3D Feature Distillation with Object-Centric Priors
Abstract
Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific and hence lack generalization, or focus on indoor room scan data that require access to multiple camera views, which is not practical in robot manipulation scenarios. Additionally, related methods typically fuse features at pixel-level and assume that all camera views are equally informative. In this work, we show that this approach leads to sub-optimal 3D features, both in terms of grounding accuracy, as well as segmentation crispness. To alleviate this, we propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views based on semantic information, and fuse features at object-level via instance segmentation masks. To distill our object-centric 3D features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop scenes, spawning 15k scenes from over 3300 unique object instances, which we make publicly available. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency, while doing so from single-view RGB-D, thus departing from the assumption of multiple camera views at test time. Finally, we show that our approach can generalize to novel tabletop domains and be re-purposed for 3D instance segmentation without fine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x1.png)
Keywords: Open-Vocabulary 3D Segmentation, Multi-view Feature Distillation
1 Introduction
Language grounding in 3D environments plays a crucial role in realizing intelligent systems that can interact naturally with the physical world. In the robotics field, being able to precisely segment desired objects in 3D based on open language queries (object semantics, visual attributes, affordances, etc.) can serve as a powerful proxy for enabling open-ended robot manipulation. As a result, research focus on 3D segmentation methods has seen growth in recent years [1, 2, 3, 4, 5, 6]. However, related methods fall in the closed-vocabulary regime, where only a fixed list of classes can be used as queries. Inspired by the success of open-vocabulary 2D methods [7, 8, 9, 10], recent efforts elevate 2D representations from pretrained models [7, 11] to 3D via distillation pipelines [12, 13, 14, 15, 16, 17, 18, 19]. However, we identify certain limitations of existing distillation approaches. On the one hand, field-based methods [13, 20, 16, 17, 18] offer continuous 3D feature fields, but require to be trained online in specific scenes and hence cannot generalize to novel object instances and compositions, they require a few minutes to train, and need to collect multiple camera views before training, all of which hinder their real-time applicability. On the other hand, original 3D feature distillation methods and follow up work [12, 14, 21] use room scan datasets [22, 23] to learn point-cloud encoders, hence being applicable in novel scenes with open vocabularies. However, such approaches assume that 2D features from all views are equally informative, which is not the case in highly cluttered indoors scenes (e.g. due to partial occlusions from some view), thus leading in noisy 3D features. 2D features are also usually fused point-wise from ViT patches [9, 10, 8] or multi-scale crops [13, 6], therefore leading to the so called “patchyness” issue [24] (see Fig. 1). The latter issue is especially impactful in robot manipulation, where precise 3D segmentation is vital for specifying robust actuation goals.
To address such limitations, we revisit 2D 3D point-based feature distillation but revise the multi-view feature fusion strategy to enhance the quality of the target 3D features. In particular, we inject both semantic and spatial object-centric priors into the fusion strategy, in three ways: (i) We obtain object-level 2D features by isolating object instances in each camera view from their 2D segmentation masks, (ii) we fuse features only at corresponding 3D object regions using 3D segmentation masks, (iii) we leverage object-level semantic information to devise an informativeness metric, which is used to weight the contribution of views and eliminate uninformative ones. Extensive ablation studies demonstrate the advantages of object-centric fusion compared to vanilla approaches. To train our method, we require a large-scale cluttered indoors dataset with many views per scene, which is currently not existent. To that end, we build MV-TOD (Multi-View Tabletop Objects Dataset), consisting of Blender scenes from more than unique 3D object models, for which we provide views per scene with coverage, further equipped with 2D/3D segmentations, 6-DoF grasps and textual object-level annotations. We use MV-TOD to distill our object-centric 3D CLIP [7] features into a 3D representation, which we call DROP-CLIP (Distilled Representations with Object-centric Priors from CLIP). Our 3D encoder operates in partial point-clouds from a single RGB-D view, thus departing from the requirement of multiple camera images at test time, while offering real-time inference capabilities. We demonstrate that our learned 3D features achieve high grounding performance and segmentation crispness, while significantly outperforming previous 2D open-vocabulary approaches in the single-view setting. Further, we show that they can be leveraged zero-shot in novel tabletop domains, as well as be used out-of-the-box for 3D instance segmentation.
In summary, our contributions are fourfold: (i) we release MV-TOD, a large-scale synthetic dataset of household objects in cluttered tabletop scenarios, featuring dense multi-view coverage and semantic/mask/grasp annotations, (ii) we identify limitations of current multi-view feature fusion approaches and illustrate how to overcome them by leveraging object-centric priors, (iii) we release DROP-CLIP, a 3D model that reconstructs view-independent 3D CLIP features from single-view, and (iv) we conduct extensive ablation studies, comparative experiments and robot demonstrations to showcase the effectiveness of the proposed method in terms of 3D segmentation performance, generalization to novel domains and tasks, and applicability in robot manipulation scenarios.
2 Multi-View Tabletop Objects Dataset
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x2.png)
Dataset | Layout | Multi | Clutter | Vision | Ref.Expr. | Grasp | Num.Obj. | Num. | Num. | Obj.-lvl |
View | Data | Annot. | Annot. | Categories | Scenes | Expr. | Semantics | |||
ScanNet [22] | indoor | ✔ | - | RGB-D,3D | ✗ | ✗ | ✗ | |||
S3DIS [25] | indoor | ✔ | - | RGB-D,3D | ✗ | ✗ | ✗ | |||
Replica [26] | indoor | ✔ | - | RGB-D,3D | ✗ | ✗ | ✔ | |||
STPLS3D [25] | outdoor | ✔ | - | 3D | ✗ | ✗ | ✔ | |||
ScanRefer [1] | indoor | ✔ | ✗ | RGB-D,3D | 2D/3D mask | ✗ | ✗ | |||
ReferIt-3D [2] | indoor | ✔ | ✗ | RGB-D,3D | 2D/3D mask | ✗ | ✗ | |||
ReferIt-RGBD [27] | indoor | ✔ | ✗ | RGB-D | 2D box | ✗ | - | ✗ | ||
SunSpot [28] | indoor | ✗ | ✔ | RGB-D | 2D box | ✗ | 38 | ✗ | ||
GraspNet [29] | tabletop | ✗ | ✔ | 3D | ✗ | 6-DoF | ✗ | |||
REGRAD [30] | tabletop | ✔ | ✔ | RGB-D,3D | ✗ | 6-DoF | ✗ | |||
OCID-VLG [31] | tabletop | ✗ | ✔ | RGB-D,3D | 2D mask | 4-DoF | template | |||
Grasp-Anything [32] | tabletop | ✗ | ✗ | RGB | 2D mask | 4-DoF | open | |||
MV-TOD (ours) | tabletop | ✔ | ✔ | RGB-D,3D | 3D mask | 6-DoF | open |
Existing 3D datasets mainly focus on indoor scenes in room layouts [33, 22, 26] and related language annotations typically cover closed-set object categories (e.g. furniture) and spatial relations [1, 2, 27, 34, 28], which are not practical for robot manipulation tasks, where cluttered tabletop scenarios and open-vocabulary language are of key importance. On the other hand, recent grasp-related efforts collect cluttered tabletop scenes, but either lack language annotations [30, 35, 29] or connect cluttered scenes with language but only for 4-DoF grasps with RGB data [31, 32]. Further, most of such datasets lack dense multi-view scene coverage, granting them non applicable for 2D 3D feature distillation, where we require multiple images from each scene to extract 2D features with a foundation model. To cover this gap, we propose MV-TOD, a large-scale synthetic dataset with cluttered tabletop scenes featuring dense multi-view coverage and rich language annotations at the object level. We generate a total of scenes in Blender [36], comprising of 3379 unique object models, collected by us and the rest filtered from ShapeNet-Sem model set [37]. The dataset features object categories, each of which includes multiple instances that vary in fine-grained details. For each object instance, we leverage GPT-4-Vision [38] to generate open-set descriptions from various perspectives, including category, color, material, state, utility, affordance, etc, which spawn over unique referring instance queries (see Fig. 2-right and Appendix A). For each scene, we provide 2D/3D segmentation masks, 6D object poses, as well as a set of semantic concepts for each appearing object instance. Additionally, we include 6-DoF grasp annotations for each object model, originating from the ACRONYM dataset [35]. To the best of our knowledge, MV-TOD is the first dataset to combine 3D cluttered tabletop scenes with open-vocabulary language and 6-DoF grasp annotations, which we hope will accelerate future research.
3 Methodology
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x3.png)
Our goal is to distill multi-view 2D CLIP features into a 3D representation, while employing an object-centric feature fusion strategy to ensure high quality 3D features. Our overall pipeline is illustrated in Fig 3. We first introduce traditional multi-view feature fusion (Sec. 3.1), present our variant with object-centric priors (Sec. 3.2) and discuss our feature distillation method (Sec. 3.3).
3.1 Multi-view 2D Feature Fusion
We assume access to a dataset of 3D scenes, where each scene is represented through a set of posed RGB-D views of size : , where the camera pose from view . For each scene, we first obtain the full point-cloud along with a 2D-3D correspondence map , mapping each point to a pixel location in image view . RGB images are fed to a pretrained image model [9, 10, 8] to obtain pixel-level 2D features of size C: , which are then back-projected to 3D points via:
(1) |
To fuse 2D features across views , previous works [12, 6, 14, 21] use average pooling: (see Appendix B for a comprehensive overview). In essence, this method assumes that all views are equally informative for each point, as long as the point is visible from that view. We suggest that naively average pooling 2D features for each point leads to sub-optimal 3D features, as noisy, uninformative views contribute equally, therefore “polluting" the overall representation. We then propose to instead use a generalized version relying on weighted average:
(2) |
where is a scalar weight that represents the informativeness of view for point . In the next subsection, we describe how to use text data to dynamically compute an informativeness weight for each view based on semantic object-level semantic information. Additionally, vanilla point-wise fusion with pixel-level 2D features leads to non-crisp segmentations and fuzzy object boundaries. To resolve this, we propose to also leverage dense spatial information, i.e., instance-wise 2D/3D segmentation masks, which are used for both: (a) obtaining robust object-level 2D CLIP features from each view, and (b) fusing features only at the points corresponding to the 3D object region.
3.2 Employing Object-Centric Priors
Let be 2D instance-wise segmentation masks for each scene, where the total number of scene objects. We aggregate the 2D masks to obtain , such that for each point we can retrieve the corresponding object instance .
Semantic informativeness metric Let be a set of object-specific textual prompts, where the number of dataset object instances and the number of prompts for object . We use CLIP’s text encoder to embed the textual prompts in and average them to obtain an object-specific prompt . For each scene, we map each object instance to its positive prompt , as well as a set of negative prompts corresponding to all other instances. We define our semantic informativeness metric as:
(3) |
Intuitively, we want a 2D feature from view to contribute to the overall 3D feature of point according to how much its similarity with the correct object instance is higher than the maximum similarity to any of the negative object instances, hence offering a proxy for semantic informativeness. We clip this weight to 0 to eliminate views that don’t satisfy the condition . Plugging in our metric in equation (2) already provides improvements over vanilla average pooling (see Sec. 4.1), however, does not deal with 3D spatial consistency, for which we employ our spatial priors below.
Object-level 2D CLIP features For obtaining object-level 2D CLIP features, we isolate the pixels for each object from each view from and crop a bounding box around the mask from : (see Appendix C for ablations in CLIP visual prompts). Here we use , i.e., only the [CLS] feature of CLIP’s ViT encoder, to represent an object crop of size . We can now define our metric from equation (3) also at object-level:
(4) |
where now represents the semantic informativeness of view for object instance .
Fusing object-wise features A 3D object-level feature can be obtained by fusing 2D object-level features across views similar to equation (2):
(5) |
where each view is weighted by its semantic informativeness metric , as well as optionally a visibility metric that measures the number of pixels from -th object’s mask that are visible from view [6]. We finally reconstruct the full feature-cloud by equating each point’s feature to its corresponding 3D object-level one via: .
3.3 View-Independent Feature Distillation
Even though the above feature-cloud could be directly used for open-vocabulary grounding in 3D, its construction is computationally intensive and requires a lot of expensive resources, such as access to multiple camera views, view-aligned 2D instance segmentation masks, as well as a set of text descriptions to compute informativeness metrics. Such utilities are rarely available in open-ended scenarios, especially in robotic applications, where usually only single-view RGB-D images from sensors mounted on the robot are provided. To tackle this, we wish to distill all the above knowledge from the feature-cloud into a 3D encoder that receives only a partial point-cloud from single-view posed RGB-D. Hence, the only assumption that we make during inference is access to camera intrinsic and extrinsic parameters, which is a mild requirement in most robotic works.
In particular, given a partial colored point-cloud from view : (3D coordinates plus colors), we train a 3D encoder such that . Notice that the distillation target is independent of view . Following [12, 15] we use cosine distance loss:
(6) |
See Appendix B.2 for training implementation details. With such a setup, we can obtain 3D features that: (i) are co-embedded in CLIP text space, so they can be leveraged for 3D segmentation tasks from open-vocabulary queries via computing cosine similarities between CLIP text embeddings and the predicted feature cloud: , (ii) are ensured to be optimally informative per object, due to the usage of the semantic informativeness metric to compute , (iii) maintain 3D spatial consistency in object boundaries, due to performing object-wise instead of point-wise fusion when computing , and (iv) are encouraged to be view-independent, as the same features are utilized as distillation targets regardless of the input view . Importantly, no labels, prompts, or segmentation masks are needed at test-time to reproduce the fused feature-cloud, while obtaining it amounts to a single forward pass of our 3D encoder, hence offering real-time performance.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x4.png)
4 Experiments
In our experiments, we explore the following questions: (i) Sec. 4.1: What are the contributions of our proposed object-centric priors for multi-view feature fusion? Does the dense number of views of our proposed dataset also contribute? (ii) Sec. 4.2: How does our method compare to previous open-vocabulary approaches for 3D semantic and referring segmentation tasks? Are the learned features robust to open-ended language? (iii) Sec. 4.3: What are the generalization capabilities of our learned 3D representation in novel domains and novel tasks (3D instance segmentation)? (iv) Sec. 4.4: Can we leverage our 3D learned representation for language-guided 6-DoF robotic grasping?
4.1 Multi-view Feature Fusion Ablation Studies
Fusion | Ref.Segm (%) | ||||||
---|---|---|---|---|---|---|---|
mIoU | Pr@25 | Pr@50 | Pr@75 | ||||
point | patch | 44.2 | 59.9 | 41.4 | 27.0 | ||
point | patch | ✓ | 37.3 | 55.4 | 33.7 | 16.7 | |
point | patch | ✓ | 57.0 | 74.1 | 59.5 | 40.9 | |
point | patch | ✓ | ✓ | 57.4 | 77.0 | 60.9 | 39.9 |
obj | obj | 65.6 | 67.0 | 65.4 | 64.1 | ||
obj | obj | ✓ | 67.3 | 68.7 | 67.1 | 65.8 | |
obj | obj | ✓ | 83.1 | 83.9 | 83.1 | 82.4 | |
obj | obj | ✓ | ✓ | 80.9 | 83.1 | 80.2 | 79.7 |
To evaluate the contributions of our proposed object-centric priors, we conduct ablation studies on the multi-view feature fusion pipeline, where we compare 3D referring segmentation results of obtained 3D features in held-out scenes of MV-TOD. We highlight that here we aim to establish a performance upper bound that the feature fusion method can provide for distillation, and not the distilled features themselves.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x5.png)
We ablate: (i) patch-wise vs. object-wise fusion, (ii) MaskCLIP [8] patch-level vs. CLIP [7] masked crop features, (iii) inclusion of visibility () and semantic informativeness () metrics for view selection. Results in Table 2.
Effect of object-centric priors We observe that all components contribute positively to the quality of the 3D features. Our proposed metric boosts mIoU across both point- and object-wise fusion ( vs. and vs. respectively). Further, we observe that the usage of spatial priors for object-wise fusion and object-level features leads to both higher segmentation crispness ( mIoU delta), as well as higher grounding precision ( Pr@75 delta). See qualitative comparisons in Appendix D.
Effect of the number of views We ablate the 3D referring segmentation performance based on the number of input views in Fig. 5, where novel viewpoints are added incrementally. We observe that in both setups (point- and object-wise) fusing features from more views leads to improvements, with a small plateauing behavior around 40 views. We believe this is an encouraging result for leveraging dense multi-view coverage in feature distillation pipelines, as we propose with the introduction of MV-TOD.
4.2 Open-Vocabulary 3D Segmentation Results
Method | #views | Ref.Segm. (%) | Sem.Segm (%) | ||||
---|---|---|---|---|---|---|---|
mIoU | Pr@25 | Pr@50 | Pr@75 | mIoU | mAcc | ||
OpenScene† [12] | 73 | 29.32 | 44.00 | 24.51 | 11.26 | 21.79 | 32.14 |
OpenMask3D∗† [6] | 73 | 65.38 | 73.05 | 63.99 | 57.40 | 59.47 | 66.48 |
DROP-CLIP† (Ours) | 73 | 82.67 | 86.11 | 82.43 | 79.23 | 75.41 | 80.02 |
DROP-CLIP (Ours) | 73 | 66.56 | 75.73 | 67.55 | 59.88 | 62.04 | 70.74 |
OpenSeg→3D [9] | 1 | 12.89 | 17.36 | 2.38 | 0.23 | 12.83 | 17.21 |
MaskCLIP→3D [8] | 1 | 25.64 | 40.36 | 18.69 | 6.95 | 20.97 | 32.09 |
DROP-CLIP (Ours) | 1 | 62.31 | 71.96 | 62.75 | 53.85 | 54.48 | 64.41 |
In this section, we compare referring and semantic segmentation performance of our distilled features vs. previous open-vocabulary approaches, both in multi-view and in single-view settings. For multi-view, we compare our trained model with OpenScene [12] and OpenMask3D [6] methods, where the full point-cloud from all views is given as input.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x6.png)
We note that for these baselines we obtain the upper-bound 3D features as before, as we observed that our trained model already outperforms them, so we refrained from also distilling features from baselines (details in Appendix C2). For single-view, we feed our network with partial point-cloud from projected RGB-D pair, and compare with 2D baselines MaskCLIP [8] and OpenSeg [9]. Our model slightly outperforms the OpenMask3D upper bound baseline in the multi-view setting ( in referring and in semantic segmentation), while significantly outperforming 2D baselines in the single-view setting ( in both tasks). Importantly, single-view results closely match the multi-view ones (), suggesting that DROP-CLIP indeed learns view-independent features.
Open-ended queries We evaluate the robustness of our model in different types of input language queries, organized in 4 families (class name - e.g. “cereal box", class + attribute - e.g. “brown cereal box", open - e.g. “chocolate Kellogs", and affordance - e.g. “I want something sweet‘). Comparative results are presented in Fig. 6 and qualitative in Fig. 4. We observe that our method achieves high grounding accuracy in all query types, even when using single-view.
4.3 Zero-Shot Transfer to Novel Domains / Tasks
Method | OCID-VLG [31] | REGRAD [30] | ||
---|---|---|---|---|
IoU | Pr@25 | IoU | Pr@25 | |
MaskCLIP→3D [8] | 24.1 | 30.9 | 33.2 | 39.0 |
DROP-CLIP (Ours) | 46.2 | 48.9 | 59.1 | 63.0 |
Generalization to Novel Domains We evaluate the 3D referring segmentation performance of our trained model when applied zero-shot in novel tabletop domains. We test in 500 scenes from OCID-VLG [31] using the dataset’s instance-wise open queries, as well as in 1000 scenes from REGRAD [30], using each model’s class name as a query. Only single-view input is provided for both datasets.
Method | |||
---|---|---|---|
SAM [39] | 70.11 | 95.26 | 79.88 |
DROP-CLIP (S) | 80.83 | 91.92 | 86.83 |
Mask3D [40] | 14.41 | 18.65 | 3.41 |
DROP-CLIP (F) | 88.37 | 93.13 | 91.47 |
We compare with MaskCLIP [8] as above and report results in Table 4. We note that test datasets contain both novel object instances (REGRAD) and classes (OCID-VLG). We observe that our method provides a significant performance boost across both domains ( mIoU delta in OCID-VLG and in REGRAD).
Zero-Shot 3D Instance Segmentation Since our method has been distilled from features with object-level priors, we demonstrate that it can be used out-of-the-box for 3D instance segmentation, via clustering the 3D features (see Appendix E for implementation details). We report results in MV-TOD in Table 5, where we compare with SAM [39] with single-view images, as well as Mask3D [41] with full point-clouds (transferred from ScanRefer [1] with room layout). Mask3D struggles to generalize to tabletop domains, whereas our method achieves comparable performance with SAM for segmenting from single-view, even without being explicitly trained for instance segmentation.
4.4 Open-Vocabulary Language-guided Robotic Grasping
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x7.png)
.
In this section, we wish to illustrate the applicability of DROP-CLIP in a language-guided robotic grasping scenario. We integrate our method with a 6-DoF grasp detection network [42], to segment and then propose gripper poses for picking a target object indicated verbally. We randomly place 5-12 objects on a tabletop with different levels of clutter, and query the robot to pick the target object and place it in a fixed position. The user instruction is open-vocabulary and can involve open object descriptions, attributes, or affordances. We conducted 50 trials in Gazebo [43] and 10 with a real robot, and observed grounding accuracy of 84% and 80% respectively, and a final success rate of 64% and 60%, where failures were mostly due to grasp proposals that are outside of the robot’s kinematic range or motion planning that lead to a collision with other objects and the table. Our setup and example trials are shown in Fig. 7, while more details and qualitative results are provided in Appendix E. A video of robot demonstrations is provided as supplementary material.
5 Related work
3D Scene Understanding There’s a long line of works in closed-set 3D scene understanding [44, 45, 46, 47, 48, 49], applied in 3D classification [50, 51], localization [52, 1] and segmentation [53, 23, 22], using two-stage pipelines with instance proposals from point-clouds [54, 55] or RGB-D views [56, 27], or single-stage methods [3] that leverage 3D-language cross attentions. [57] use CLIP embeddings for pretraining a 3D segmentation model, but still cannot be applied open-vocabulary.
Open-Vocabulary Grounding with CLIP Following the impressive results of CLIP [7] for open-set image recognition, followup works transfer CLIP’s powerful representations from image- to pixel-level [40, 58, 59, 60, 61, 62, 63, 9, 10, 8], extending to detection / segmentation, but limited to 2D. For 3D segmentation, the closest work is perhaps OpenMask3D [6] that extracts multi-view CLIP features from instance proposals from Mask3D [41] to compute similarities with open text queries.
3D CLIP Feature Distillation Recent works distill features from 2D foundation models with point-cloud encoders [12, 14, 21] or neural fields [13, 19, 17, 18, 19, 24], with applications in robot manipulation [20, 16] and navigation [64, 65]. However, associated works extract 2D features from OpenSeg [9], LSeg [10], MaskCLIP [8] or multi-scale crops from CLIP [7] and fuse point-wise with average pooling, while our approach leverages semantics-informed view selection and segmentation masks to do object-wise fusion with object-level features (see detailed overview in Appendix F).
6 Conclusion, Limitations and Future Work
We propose DROP-CLIP, a 2D3D CLIP feature distillation framework that employs object-centric priors to select views based on semantic informativeness and ensure crisp 3D segmentations, while working with single-view RGB-D. We also release MV-TOD, a large-scale synthetic dataset of multi-view tabletop scenes with dense annotations that can be leveraged for several downstream tasks. We hope our work can benefit the robotics community, both in terms of released resources as well as illustrating and overcoming theoretical limitations of existing 3D feature distillation works.
While our spatial object-centric priors lead to improved segmentation quality, they collapse local features in favor of a global object-level feature, and hence cannot be applied for segmenting object parts. In the future, we plan to add object part annotations in our dataset and fuse with both object- and part-level masks. Second, DROP-CLIP only provides grounding and a two-stage pipeline is needed for grasping, while our dataset already provides rich 6-DoF grasp annotations. A next step would be to also distill them, opting for a joint 3D representation for grounding and grasping.
References
- Chen et al. [2020] D. Z. Chen, A. X. Chang, and M. Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 202–221. Springer, 2020.
- Achlioptas et al. [2020] P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. 16th European Conference on Computer Vision (ECCV), 2020.
- Luo et al. [2022] J. Luo, J. Fu, X. Kong, C. Gao, H. Ren, H. Shen, H. Xia, and S. Liu. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16454–16463, 2022.
- Huang et al. [2021] P.-H. Huang, H.-H. Lee, H.-T. Chen, and T.-L. Liu. Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1610–1618, 2021.
- Qian et al. [2024] Z. Qian, Y. Ma, J. Ji, and X. Sun. X-refseg3d: Enhancing referring 3d instance segmentation via structured cross-modal graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4551–4559, 2024.
- Takmaz et al. [2023] A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann. Openmask3d: Open-vocabulary 3d instance segmentation. ArXiv, abs/2306.13631, 2023. URL https://api.semanticscholar.org/CorpusID:259243888.
- Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021. URL https://arxiv.org/abs/2103.00020.
- Dong et al. [2022] X. Dong, Y. Zheng, J. Bao, T. Zhang, D. Chen, H. Yang, M. Zeng, W. Zhang, L. Yuan, D. Chen, F. Wen, and N. Yu. Maskclip: Masked self-distillation advances contrastive language-image pretraining. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10995–11005, 2022. URL https://api.semanticscholar.org/CorpusID:251799827.
- Ghiasi et al. [2021] G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, 2021. URL https://api.semanticscholar.org/CorpusID:250895808.
- Li et al. [2022] B. Li, K. Q. Weinberger, S. J. Belongie, V. Koltun, and R. Ranftl. Language-driven semantic segmentation. ArXiv, abs/2201.03546, 2022. URL https://api.semanticscholar.org/CorpusID:245836975.
- Oquab et al. [2023] M. Oquab, T. Darcet, T. Moutakanni, H. Q. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. B. Huang, S.-W. Li, I. Misra, M. G. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without supervision. ArXiv, abs/2304.07193, 2023. URL https://api.semanticscholar.org/CorpusID:258170077.
- Peng et al. [2022] S. Peng, K. Genova, ChiyuMaxJiang, A. Tagliasacchi, M. Pollefeys, and T. A. Funkhouser. Openscene: 3d scene understanding with open vocabularies. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–824, 2022. URL https://api.semanticscholar.org/CorpusID:254044069.
- Kerr et al. [2023] J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik. Lerf: Language embedded radiance fields. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 19672–19682, 2023. URL https://api.semanticscholar.org/CorpusID:257557329.
- Nguyen et al. [2023] P. D. Nguyen, T. Ngo, C. Gan, E. Kalogerakis, A. D. Tran, C. Pham, and K. Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. ArXiv, abs/2312.10671, 2023. URL https://api.semanticscholar.org/CorpusID:266348609.
- Koch et al. [2024] S. Koch, N. Vaskevicius, M. Colosi, P. Hermosilla, and T. Ropinski. Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. ArXiv, abs/2402.12259, 2024. URL https://api.semanticscholar.org/CorpusID:267750890.
- Shen et al. [2023] B. W. Shen, G. Yang, A. Yu, J. R. Wong, L. P. Kaelbling, and P. Isola. Distilled feature fields enable few-shot language-guided manipulation. In Conference on Robot Learning, 2023. URL https://api.semanticscholar.org/CorpusID:260926035.
- Tschernezki et al. [2022] V. Tschernezki, I. Laina, D. Larlus, and A. Vedaldi. Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. 2022 International Conference on 3D Vision (3DV), pages 443–453, 2022. URL https://api.semanticscholar.org/CorpusID:252118532.
- Kobayashi et al. [2022] S. Kobayashi, E. Matsumoto, and V. Sitzmann. Decomposing nerf for editing via feature field distillation. ArXiv, abs/2205.15585, 2022. URL https://api.semanticscholar.org/CorpusID:249209811.
- Engelmann et al. [2024] F. Engelmann, F. Manhardt, M. Niemeyer, K. Tateno, M. Pollefeys, and F. Tombari. Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views, 2024.
- Rashid et al. [2023] A. Rashid, S. Sharma, C. M. Kim, J. Kerr, L. Y. Chen, A. Kanazawa, and K. Goldberg. Language embedded radiance fields for zero-shot task-oriented grasping. In Conference on Robot Learning, 2023. URL https://api.semanticscholar.org/CorpusID:261882332.
- Zhang et al. [2023] J. Zhang, R. Dong, and K. Ma. Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2040–2051, 2023. URL https://api.semanticscholar.org/CorpusID:257404908.
- Dai et al. [2017] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017.
- Ramakrishnan et al. [2021] S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y. Zhao, and D. Batra. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. ArXiv, abs/2109.08238, 2021. URL https://api.semanticscholar.org/CorpusID:237563216.
- Qin et al. [2024] M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister. Langsplat: 3d language gaussian splatting, 2024.
- Chen et al. [2022] M. Chen, Q. Hu, Z. Yu, H. Thomas, A. Feng, Y. Hou, K. McCullough, F. Ren, and L. Soibelman. Stpls3d: A large-scale synthetic and real aerial photogrammetry 3d point cloud dataset. arXiv preprint arXiv:2203.09065, 2022.
- Straub et al. [2019] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
- Liu et al. [2021] H. Liu, A. Lin, X. Han, L. Yang, Y. Yu, and S. Cui. Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6028–6037, 2021.
- Mauceri et al. [2019] C. Mauceri, M. Palmer, and C. Heckman. Sun-spot: An rgb-d dataset with spatial referring expressions. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 1883–1886, 2019.
- Fang et al. [2020] H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020.
- Zhang et al. [2022] H. Zhang, D. Yang, H. Wang, B. Zhao, X. Lan, J. Ding, and N. Zheng. Regrad: A large-scale relational grasp dataset for safe and object-specific robotic grasping in clutter. IEEE Robotics and Automation Letters, 7(2):2929–2936, 2022.
- Tziafas et al. [2023] G. Tziafas, X. Yucheng, A. Goel, M. Kasaei, Z. Li, and H. Kasaei. Language-guided robot grasping: Clip-based referring grasp synthesis in clutter. In 7th Annual Conference on Robot Learning, 2023.
- Vuong et al. [2023] A. D. Vuong, M. N. Vu, H. Le, B. Huang, B. P. K. Huynh, T. D. Vo, A. Kugi, and A. Nguyen. Grasp-anything: Large-scale grasp dataset from foundation models. ArXiv, abs/2309.09818, 2023. URL https://api.semanticscholar.org/CorpusID:262045996.
- Armeni et al. [2016] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016.
- Rozenberszki et al. [2022] D. Rozenberszki, O. Litany, and A. Dai. Language-grounded indoor 3d semantic segmentation in the wild. In European Conference on Computer Vision, pages 125–141. Springer, 2022.
- Eppner et al. [2020] C. Eppner, A. Mousavian, and D. Fox. ACRONYM: A large-scale grasp dataset based on simulation. In 2021 IEEE Int. Conf. on Robotics and Automation, ICRA, 2020.
- Community [2018] B. O. Community. Blender - a 3d modelling and rendering package. 2018. URL http://www.blender.org.
- Chang et al. [2015] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
- GPT [2023] Gpt-4v(ision) system card. 2023. URL https://api.semanticscholar.org/CorpusID:263218031.
- Kirillov et al. [2023] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. B. Girshick. Segment anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3992–4003, 2023. URL https://api.semanticscholar.org/CorpusID:257952310.
- Gu et al. [2021] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui. Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2021. URL https://api.semanticscholar.org/CorpusID:238744187.
- Schult et al. [2023] J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe. Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. 2023.
- Chen et al. [2023] S. Chen, W. N. Tang, P. Xie, W. Yang, and G. Wang. Efficient heatmap-guided 6-dof grasp detection in cluttered scenes. IEEE Robotics and Automation Letters, 8:4895–4902, 2023. URL https://api.semanticscholar.org/CorpusID:259363869.
- Koenig and Howard [2004] N. P. Koenig and A. Howard. Design and use paradigms for gazebo, an open-source multi-robot simulator. 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), 3:2149–2154 vol.3, 2004.
- Choy et al. [2019] C. B. Choy, J. Gwak, and S. Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3070–3079, 2019. URL https://api.semanticscholar.org/CorpusID:121123422.
- Han et al. [2020] L. Han, T. Zheng, L. Xu, and L. Fang. Occuseg: Occupancy-aware 3d instance segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2937–2946, 2020. URL https://api.semanticscholar.org/CorpusID:212725768.
- Hu et al. [2021a] W. Hu, H. Zhao, L. Jiang, J. Jia, and T.-T. Wong. Bidirectional projection network for cross dimension scene understanding. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14368–14377, 2021a. URL https://api.semanticscholar.org/CorpusID:232379958.
- Hu et al. [2021b] Z. Hu, X. Bai, J. Shang, R. Zhang, J. Dong, X. Wang, G. Sun, H. Fu, and C.-L. Tai. Vmnet: Voxel-mesh network for geodesic-aware 3d semantic segmentation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15468–15478, 2021b. URL https://api.semanticscholar.org/CorpusID:236493200.
- Li et al. [2022] J. Li, X. He, Y. Wen, Y. Gao, X. Cheng, and D. Zhang. Panoptic-phnet: Towards real-time and high-precision lidar panoptic segmentation via clustering pseudo heatmap. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11799–11808, 2022. URL https://api.semanticscholar.org/CorpusID:248811224.
- Robert et al. [2022] D. Robert, B. Vallet, and L. Landrieu. Learning multi-view aggregation in the wild for large-scale 3d semantic segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5565–5574, 2022. URL https://api.semanticscholar.org/CorpusID:248218804.
- Wu et al. [2014] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1912–1920, 2014. URL https://api.semanticscholar.org/CorpusID:206592833.
- Zhang et al. [2021] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. J. Qiao, P. Gao, and H. Li. Pointclip: Point cloud understanding by clip. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8542–8552, 2021. URL https://api.semanticscholar.org/CorpusID:244909021.
- Caesar et al. [2019] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2019. URL https://api.semanticscholar.org/CorpusID:85517967.
- Behley et al. [2019] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9296–9306, 2019. URL https://api.semanticscholar.org/CorpusID:199441943.
- Achlioptas et al. [2020] P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. J. Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In European Conference on Computer Vision, 2020.
- Zhao et al. [2021] L. Zhao, D. Cai, L. Sheng, and D. Xu. 3dvg-transformer: Relation modeling for visual grounding on point clouds. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2908–2917, 2021.
- Huang et al. [2022] S. Huang, Y. Chen, J. Jia, and L. Wang. Multi-view transformer for 3d visual grounding. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15503–15512, 2022.
- Rozenberszki et al. [2022] D. Rozenberszki, O. Litany, and A. Dai. Language-grounded indoor 3d semantic segmentation in the wild. ArXiv, abs/2204.07761, 2022. URL https://api.semanticscholar.org/CorpusID:248227627.
- Zhong et al. [2021] Y. Zhong, J. Yang, P. Zhang, C. Li, N. C. F. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li, and J. Gao. Regionclip: Region-based language-image pretraining. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16772–16782, 2021. URL https://api.semanticscholar.org/CorpusID:245218534.
- Minderer et al. [2022] M. Minderer, A. A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, X. Wang, X. Zhai, T. Kipf, and N. Houlsby. Simple open-vocabulary object detection with vision transformers. ArXiv, abs/2205.06230, 2022. URL https://api.semanticscholar.org/CorpusID:248721818.
- Zhou et al. [2022] X. Zhou, R. Girdhar, A. Joulin, P. Krahenbuhl, and I. Misra. Detecting twenty-thousand classes using image-level supervision. ArXiv, abs/2201.02605, 2022. URL https://api.semanticscholar.org/CorpusID:245827815.
- Minderer et al. [2023] M. Minderer, A. A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection. ArXiv, abs/2306.09683, 2023. URL https://api.semanticscholar.org/CorpusID:259187664.
- Wang et al. [2021] Z. Wang, Y. Lu, Q. Li, X. Tao, Y. Guo, M. Gong, and T. Liu. Cris: Clip-driven referring image segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11676–11685, 2021. URL https://api.semanticscholar.org/CorpusID:244729320.
- Lüddecke and Ecker [2021] T. Lüddecke and A. S. Ecker. Image segmentation using text and image prompts. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7076–7086, 2021. URL https://api.semanticscholar.org/CorpusID:247794227.
- Shafiullah et al. [2022] N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory. ArXiv, abs/2210.05663, 2022. URL https://api.semanticscholar.org/CorpusID:252815898.
- Bolte et al. [2023] B. Bolte, A. S. Wang, J. Yang, M. Mukadam, M. Kalakrishnan, and C. Paxton. Usa-net: Unified semantic and affordance representations for robot memory. 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8, 2023. URL https://api.semanticscholar.org/CorpusID:258298248.