Supplementary Material: 3D Copy-Paste: Physically Plausible Object Insertion for Monocular 3D Detection
Appendix A Experiments on more Monocular 3D Object Detection methods
In our main paper, we utilize ImVoxelNet (rukhovich2022imvoxelnet) for monocular 3D object detection. To show the robustness of our 3D Copy-Paste across different downstream detection methods. We conducted additional experiments with another monocular 3D object detection model: Implicit3DUnderstanding (Im3D (zhang2021holistic)). The Im3D model predicts object 3D shapes, bounding boxes, and scene layout within a unified pipeline. Training this model necessitates not only the SUN RGB-D dataset but also the Pix3D dataset (sun2018pix3d), which supplies 3D mesh supervision. The Im3D training process consists of two stages. In stage one, individual modules - the Layout Estimation Network, Object Detection Network, Local Implicit Embedding Network, and Scene Graph Convolutional Network - are pretrained separately. In stage two, all these modules undergo joint training. We incorporate our 3D Copy-Paste method only during this second stage of joint training, and it’s exclusively applied to the 10 SUN RGB-D categories we used in the main paper. We implemented our experiment following the official Im3D guidelines111https://github.com/chengzhag/Implicit3DUnderstanding.
Table 1 displays the Im3D results for monocular 3D object detection on the SUN RGB-D dataset, adhering to the same ten categories outlined in main paper. Im3D without insertion, attained a mean average precision (mAP) detection performance of 42.13%. After applying our 3D Copy-Paste method—which encompasses physically plausible insertion of position, pose, size, and light—the monocular 3D object detection mAP performance increased to 43.34. These results further substantiate the robustness and effectiveness of our proposed method.
Setting | Insertion Position, Pose, Size | Insertion Illumination | mAP |
---|---|---|---|
Im3D | N/A | N/A | 42.13 |
Im3D + 3D Copy-Paste | Plausible position, size, pose | Plausible dynamic light | 43.34 |
Appendix B More experiment details
We run the same experiments multiple times with different random seeds. Table 2 shows the main paper Table LABEL:tab:2 results with error range.
Setting | Insertion Position, Pose, Size | Insertion Illumination | mAP0.25 |
---|---|---|---|
ImVoxelNet | N/A | N/A | 40.96 0.4 |
ImVoxelNet + random insert | Random | Camera point light | 37.02 0.4 |
ImVoxelNet + 3D Copy-Paste (w/o light) | Plausible position, size, pose | Camera point light | 41.80 0.3 |
ImVoxelNet + 3D Copy-Paste | Plausible position, size, pose | Plausible dynamic light | 43.79 0.4 |
We also show our results with mAP0.15 on SUN RGB-D dataset (Table 3), our method shows consistent improvements.
Setting | Insertion Position, Pose, Size | Insertion Illumination | mAP0.15 |
---|---|---|---|
ImVoxelNet | N/A | N/A | 48.45 |
ImVoxelNet + 3D Copy-Paste | Plausible position, size, pose | Plausible dynamic light | 51.16 |
Appendix C Discussion on Limitations and Broader Impact
Limitations. Our method, while effective, does have certain limitations. A key constraint is its reliance on the availability of external 3D objects, particularly for uncommon categories where sufficient 3D assets may not be readily available. This limitation could potentially impact the performance of downstream tasks. Moreover, the quality of inserted objects can also affect the results. Possible strategies to address this limitation could include leveraging techniques like Neural Radiance Fields (NeRF) to construct higher-quality 3D assets for different categories.
Broader Impact. Our proposed 3D Copy-Paste method demonstrate that physically plausible 3D object insertion can serve as an effective generative data augmentation technique, leading to state-of-the-art performance in discriminative downstream tasks like monocular 3D object detection. The implications of this work are profound for both the computer graphics and computer vision communities. From a graphics perspective, our method demonstrates that more accurate 3D property estimation, reconstruction, and inverse rendering techniques can generate more plausible 3D assets and better scene understanding. These assets not only look visually compelling but can also effectively contribute to downstream computer vision tasks. From a computer vision perspective, it encourages us to utilize synthetic data more effectively to tackle challenges in downstream fields, including computer vision and robotics.