3D Feature Distillation with Object-Centric Priors

Georgios Tziafas
Department of Artificial Intelligence
University of Groningen, the Neteherlands
g.t.tziafas@rug.nl
&Yucheng Xu
School of Informatics
University of Edinburgh, United Kingdom
Yucheng.Xu@ed.ac.uk
&Zhibin Li
Department of Computer Science
University College London, United Kingdom
alex.li@ucl.ac.uk
&Hamidreza Kasaei
Department of Artificial Intelligence
University of Groningen, the Neteherlands
h.kasaei@rug.nl
Abstract

Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific and hence lack generalization, or focus on indoor room scan data that require access to multiple camera views, which is not practical in robot manipulation scenarios. Additionally, related methods typically fuse features at pixel-level and assume that all camera views are equally informative. In this work, we show that this approach leads to sub-optimal 3D features, both in terms of grounding accuracy, as well as segmentation crispness. To alleviate this, we propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views based on semantic information, and fuse features at object-level via instance segmentation masks. To distill our object-centric 3D features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop scenes, spawning 15k scenes from over 3300 unique object instances, which we make publicly available. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency, while doing so from single-view RGB-D, thus departing from the assumption of multiple camera views at test time. Finally, we show that our approach can generalize to novel tabletop domains and be re-purposed for 3D instance segmentation without fine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter.

Refer to caption
Figure 1: Visualization of 3D features (middle), back-projected 2D features (left) and user query similarity heatmaps (right), for previous SOTA point-cloud feature distillation method OpenScene and our DROP-CLIP. OpenScene fuses pixel-wise 2D features across all views with average pooling, leading to grounding failures, segmentation imprecisions and fuzzy object boundaries. Our method resolves these issues by employing object-centric priors to fuse object-level 2D features in 3D instance masks with semantics-informed view selection.

Keywords: Open-Vocabulary 3D Segmentation, Multi-view Feature Distillation

1 Introduction

Language grounding in 3D environments plays a crucial role in realizing intelligent systems that can interact naturally with the physical world. In the robotics field, being able to precisely segment desired objects in 3D based on open language queries (object semantics, visual attributes, affordances, etc.) can serve as a powerful proxy for enabling open-ended robot manipulation. As a result, research focus on 3D segmentation methods has seen growth in recent years [1, 2, 3, 4, 5, 6]. However, related methods fall in the closed-vocabulary regime, where only a fixed list of classes can be used as queries. Inspired by the success of open-vocabulary 2D methods  [7, 8, 9, 10], recent efforts elevate 2D representations from pretrained models  [7, 11] to 3D via distillation pipelines  [12, 13, 14, 15, 16, 17, 18, 19]. However, we identify certain limitations of existing distillation approaches. On the one hand, field-based methods  [13, 20, 16, 17, 18] offer continuous 3D feature fields, but require to be trained online in specific scenes and hence cannot generalize to novel object instances and compositions, they require a few minutes to train, and need to collect multiple camera views before training, all of which hinder their real-time applicability. On the other hand, original 3D feature distillation methods and follow up work [12, 14, 21] use room scan datasets  [22, 23] to learn point-cloud encoders, hence being applicable in novel scenes with open vocabularies. However, such approaches assume that 2D features from all views are equally informative, which is not the case in highly cluttered indoors scenes (e.g. due to partial occlusions from some view), thus leading in noisy 3D features. 2D features are also usually fused point-wise from ViT patches  [9, 10, 8] or multi-scale crops [13, 6], therefore leading to the so called “patchyness” issue [24] (see Fig. 1). The latter issue is especially impactful in robot manipulation, where precise 3D segmentation is vital for specifying robust actuation goals.

To address such limitations, we revisit 2D \rightarrow 3D point-based feature distillation but revise the multi-view feature fusion strategy to enhance the quality of the target 3D features. In particular, we inject both semantic and spatial object-centric priors into the fusion strategy, in three ways: (i) We obtain object-level 2D features by isolating object instances in each camera view from their 2D segmentation masks, (ii) we fuse features only at corresponding 3D object regions using 3D segmentation masks, (iii) we leverage object-level semantic information to devise an informativeness metric, which is used to weight the contribution of views and eliminate uninformative ones. Extensive ablation studies demonstrate the advantages of object-centric fusion compared to vanilla approaches. To train our method, we require a large-scale cluttered indoors dataset with many views per scene, which is currently not existent. To that end, we build MV-TOD (Multi-View Tabletop Objects Dataset), consisting of 15ksimilar-toabsent15𝑘\sim 15k∼ 15 italic_k Blender scenes from more than 3.3k3.3𝑘3.3k3.3 italic_k unique 3D object models, for which we provide 73737373 views per scene with 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT coverage, further equipped with 2D/3D segmentations, 6-DoF grasps and textual object-level annotations. We use MV-TOD to distill our object-centric 3D CLIP  [7] features into a 3D representation, which we call DROP-CLIP (Distilled Representations with Object-centric Priors from CLIP). Our 3D encoder operates in partial point-clouds from a single RGB-D view, thus departing from the requirement of multiple camera images at test time, while offering real-time inference capabilities. We demonstrate that our learned 3D features achieve high grounding performance and segmentation crispness, while significantly outperforming previous 2D open-vocabulary approaches in the single-view setting. Further, we show that they can be leveraged zero-shot in novel tabletop domains, as well as be used out-of-the-box for 3D instance segmentation.

In summary, our contributions are fourfold: (i) we release MV-TOD, a large-scale synthetic dataset of household objects in cluttered tabletop scenarios, featuring dense multi-view coverage and semantic/mask/grasp annotations, (ii) we identify limitations of current multi-view feature fusion approaches and illustrate how to overcome them by leveraging object-centric priors, (iii) we release DROP-CLIP, a 3D model that reconstructs view-independent 3D CLIP features from single-view, and (iv) we conduct extensive ablation studies, comparative experiments and robot demonstrations to showcase the effectiveness of the proposed method in terms of 3D segmentation performance, generalization to novel domains and tasks, and applicability in robot manipulation scenarios.

2 Multi-View Tabletop Objects Dataset

Refer to caption
Figure 2: (Left:) Example generated Blender scene, multi-view image coverage and annotations included in our dataset. (Right:) Automatic semantic annotation generation with large vision-language models.
Dataset Layout Multi Clutter Vision Ref.Expr. Grasp Num.Obj. Num. Num. Obj.-lvl
View Data Annot. Annot. Categories Scenes Expr. Semantics
ScanNet [22] indoor - RGB-D,3D 17171717 800800800800 --
S3DIS [25] indoor - RGB-D,3D 13131313 6666 --
Replica [26] indoor - RGB-D,3D 88888888 -- --
STPLS3D [25] outdoor - 3D 12121212 18181818 --
ScanRefer [1] indoor RGB-D,3D 2D/3D mask 18181818 800800800800 51.5k51.5𝑘51.5k51.5 italic_k
ReferIt-3D [2] indoor RGB-D,3D 2D/3D mask 18181818 707707707707 125.5k125.5𝑘125.5k125.5 italic_k
ReferIt-RGBD [27] indoor RGB-D 2D box - 7.6k7.6𝑘7.6k7.6 italic_k 38.4k38.4𝑘38.4k38.4 italic_k
SunSpot [28] indoor RGB-D 2D box 38 1.9k1.9𝑘1.9k1.9 italic_k 7.0k7.0𝑘7.0k7.0 italic_k
GraspNet [29] tabletop 3D 6-DoF 88888888 190190190190 --
REGRAD [30] tabletop RGB-D,3D 6-DoF 55555555 47k47𝑘47k47 italic_k --
OCID-VLG [31] tabletop RGB-D,3D 2D mask 4-DoF 31313131 1.7k1.7𝑘1.7k1.7 italic_k 89.6k89.6𝑘89.6k89.6 italic_k template
Grasp-Anything [32] tabletop RGB 2D mask 4-DoF 236236236236 1M1𝑀1M1 italic_M -- open
MV-TOD (ours) tabletop RGB-D,3D 3D mask 6-DoF 149149149149 15k15𝑘15k15 italic_k 671.2k671.2𝑘671.2k671.2 italic_k open
Table 1: Comparisons between MV-TOD and existing 3D datasets.

Existing 3D datasets mainly focus on indoor scenes in room layouts [33, 22, 26] and related language annotations typically cover closed-set object categories (e.g. furniture) and spatial relations [1, 2, 27, 34, 28], which are not practical for robot manipulation tasks, where cluttered tabletop scenarios and open-vocabulary language are of key importance. On the other hand, recent grasp-related efforts collect cluttered tabletop scenes, but either lack language annotations [30, 35, 29] or connect cluttered scenes with language but only for 4-DoF grasps with RGB data  [31, 32]. Further, most of such datasets lack dense multi-view scene coverage, granting them non applicable for 2D \rightarrow 3D feature distillation, where we require multiple images from each scene to extract 2D features with a foundation model. To cover this gap, we propose MV-TOD, a large-scale synthetic dataset with cluttered tabletop scenes featuring dense multi-view coverage and rich language annotations at the object level. We generate a total of 15k15𝑘15k15 italic_k scenes in Blender [36], comprising of 3379 unique object models, 99999999 collected by us and the rest filtered from ShapeNet-Sem model set  [37]. The dataset features 149149149149 object categories, each of which includes multiple instances that vary in fine-grained details. For each object instance, we leverage GPT-4-Vision [38] to generate open-set descriptions from various perspectives, including category, color, material, state, utility, affordance, etc, which spawn over 670k670𝑘670k670 italic_k unique referring instance queries (see Fig. 2-right and Appendix A). For each scene, we provide 2D/3D segmentation masks, 6D object poses, as well as a set of semantic concepts for each appearing object instance. Additionally, we include 6-DoF grasp annotations for each object model, originating from the ACRONYM dataset  [35]. To the best of our knowledge, MV-TOD is the first dataset to combine 3D cluttered tabletop scenes with open-vocabulary language and 6-DoF grasp annotations, which we hope will accelerate future research.

3 Methodology

Refer to caption
Figure 3: Method Overview: Given a 3D scene and multiple camera views, we employ three object-centric priors (in red) for multi-view feature fusion: (i) extract CLIP features from 2D masked object crops, (ii) use semantic annotations to fuse 2D features across views, (iii) apply the fused feature on all points in the object’s 3D mask. The fused feature-cloud is distilled with a single-view posed RGB-D encoder and cosine distance loss. During inference, we compute point-wise cosine similarity scores in CLIP space (higher similarity towards red).

Our goal is to distill multi-view 2D CLIP features into a 3D representation, while employing an object-centric feature fusion strategy to ensure high quality 3D features. Our overall pipeline is illustrated in Fig 3. We first introduce traditional multi-view feature fusion (Sec. 3.1), present our variant with object-centric priors (Sec. 3.2) and discuss our feature distillation method (Sec. 3.3).

3.1 Multi-view 2D Feature Fusion

We assume access to a dataset of 3D scenes, where each scene is represented through a set of 𝒱𝒱\mathcal{V}caligraphic_V posed RGB-D views of size H×W𝐻𝑊H\times Witalic_H × italic_W: {IvH×W×3,DvH×W,Tv4×4}v=1𝒱superscriptsubscriptformulae-sequencesubscript𝐼𝑣superscript𝐻𝑊3formulae-sequencesubscript𝐷𝑣superscript𝐻𝑊subscript𝑇𝑣superscript44𝑣1𝒱\left\{I_{v}\in\mathbb{R}^{H\times W\times 3},\,D_{v}\in\mathbb{R}^{H\times W}% ,\,T_{v}\in\mathbb{R}^{4\times 4}\right\}_{v=1}^{\mathcal{V}}{ italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 × 4 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT, where Tvsubscript𝑇𝑣T_{v}italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT the camera pose from view v𝑣vitalic_v. For each scene, we first obtain the full point-cloud PM×3𝑃superscript𝑀3P\in\mathbb{R}^{M\times 3}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 3 end_POSTSUPERSCRIPT along with a 2D-3D correspondence map vM×2subscript𝑣superscript𝑀2\mathcal{M}_{v}\in\mathbb{R}^{M\times 2}caligraphic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 2 end_POSTSUPERSCRIPT, mapping each point 𝐱i,i=1,,Mformulae-sequencesubscript𝐱𝑖𝑖1𝑀\mathbf{x}_{i},\,i=1,\dots,Mbold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_M to a pixel location 𝐮v,i=(ux,uy)Tsubscript𝐮𝑣𝑖superscriptsubscript𝑢𝑥subscript𝑢𝑦𝑇\mathbf{u}_{v,i}=(u_{x},u_{y})^{T}bold_u start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT = ( italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in image view Ivsubscript𝐼𝑣I_{v}italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. RGB images are fed to a pretrained image model f2D:H×W×3H×W×C:superscript𝑓2𝐷superscript𝐻𝑊3superscript𝐻𝑊𝐶f^{2D}:\mathbb{R}^{H\times W\times 3}\rightarrow\mathbb{R}^{H\times W\times C}italic_f start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT [9, 10, 8] to obtain pixel-level 2D features of size C: Zv2D=f2D(Iv)subscriptsuperscript𝑍2𝐷𝑣superscript𝑓2𝐷subscript𝐼𝑣Z^{2D}_{v}=f^{2D}(I_{v})italic_Z start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ), which are then back-projected to 3D points via:

zv,i2D=f2D(Iv(uv,i))=f2D(Iv(v(𝐱i)))subscriptsuperscriptz2𝐷𝑣𝑖superscript𝑓2𝐷subscript𝐼𝑣subscriptu𝑣𝑖superscript𝑓2𝐷subscript𝐼𝑣subscript𝑣subscript𝐱𝑖\textbf{z}^{2D}_{v,i}=f^{2D}\left(I_{v}(\textbf{u}_{v,i})\right)=f^{2D}\left(I% _{v}(\mathcal{M}_{v}(\mathbf{x}_{i}))\right)z start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( u start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT ) ) = italic_f start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) (1)

To fuse 2D features across views Z3DM×Csuperscript𝑍3𝐷superscript𝑀𝐶Z^{3D}\in\mathbb{R}^{M\times C}italic_Z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_C end_POSTSUPERSCRIPT, previous works  [12, 6, 14, 21] use average pooling: 𝐳i3D=1𝒱v=1𝒱𝐳v,i2Dsuperscriptsubscript𝐳𝑖3𝐷1𝒱superscriptsubscript𝑣1𝒱superscriptsubscript𝐳𝑣𝑖2𝐷\mathbf{z}_{i}^{3D}=\frac{1}{\mathcal{V}}\sum_{v=1}^{\mathcal{V}}\mathbf{z}_{v% ,i}^{2D}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG caligraphic_V end_ARG ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT bold_z start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT (see Appendix B for a comprehensive overview). In essence, this method assumes that all views are equally informative for each point, as long as the point is visible from that view. We suggest that naively average pooling 2D features for each point leads to sub-optimal 3D features, as noisy, uninformative views contribute equally, therefore “polluting" the overall representation. We then propose to instead use a generalized version relying on weighted average:

𝐳i3D=v=1𝒱𝐳v,i2Dωv,iv=1𝒱ωv,isubscriptsuperscript𝐳3𝐷𝑖superscriptsubscript𝑣1𝒱subscriptsuperscript𝐳2𝐷𝑣𝑖subscript𝜔𝑣𝑖superscriptsubscript𝑣1𝒱subscript𝜔𝑣𝑖\mathbf{z}^{3D}_{i}=\frac{\sum_{v=1}^{\mathcal{V}}\mathbf{z}^{2D}_{v,i}\cdot% \omega_{v,i}}{\sum_{v=1}^{\mathcal{V}}\omega_{v,i}}bold_z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT bold_z start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT ⋅ italic_ω start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT end_ARG (2)

where ωv,isubscript𝜔𝑣𝑖\omega_{v,i}\in\mathbb{R}italic_ω start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT ∈ blackboard_R is a scalar weight that represents the informativeness of view v𝑣vitalic_v for point i𝑖iitalic_i. In the next subsection, we describe how to use text data to dynamically compute an informativeness weight for each view based on semantic object-level semantic information. Additionally, vanilla point-wise fusion with pixel-level 2D features leads to non-crisp segmentations and fuzzy object boundaries. To resolve this, we propose to also leverage dense spatial information, i.e., instance-wise 2D/3D segmentation masks, which are used for both: (a) obtaining robust object-level 2D CLIP features from each view, and (b) fusing features only at the points corresponding to the 3D object region.

3.2 Employing Object-Centric Priors

Let {Sv2D{0,1}N×H×W}v=1𝒱superscriptsubscriptsuperscriptsubscript𝑆𝑣2𝐷superscript01𝑁𝐻𝑊𝑣1𝒱\left\{S_{v}^{2D}\in\{0,1\}^{N\times H\times W}\right\}_{v=1}^{\mathcal{V}}{ italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_H × italic_W end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT be 2D instance-wise segmentation masks for each scene, where N𝑁Nitalic_N the total number of scene objects. We aggregate the 2D masks to obtain S3D{0,1}M×Nsuperscript𝑆3𝐷superscript01𝑀𝑁S^{3D}\in\{0,1\}^{M\times N}italic_S start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT, such that for each point i𝑖iitalic_i we can retrieve the corresponding object instance ni=argmaxνSi,ν3Dsubscript𝑛𝑖subscriptargmax𝜈subscriptsuperscript𝑆3𝐷𝑖𝜈n_{i}=\texttt{argmax}_{\nu}\;S^{3D}_{i,\nu}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = argmax start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_ν end_POSTSUBSCRIPT.

Semantic informativeness metric Let 𝒬={Qk}k=1𝒦,QkNk×Cformulae-sequence𝒬superscriptsubscriptsubscript𝑄𝑘𝑘1𝒦subscript𝑄𝑘superscriptsubscript𝑁𝑘𝐶\mathcal{Q}=\left\{Q_{k}\right\}_{k=1}^{\mathcal{K}},\;Q_{k}\in\mathbb{R}^{N_{% k}\times C}caligraphic_Q = { italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_K end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT be a set of object-specific textual prompts, where 𝒦𝒦\mathcal{K}caligraphic_K the number of dataset object instances and Nksubscript𝑁𝑘N_{k}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the number of prompts for object k𝑘kitalic_k. We use CLIP’s text encoder to embed the textual prompts in Csuperscript𝐶\mathbb{R}^{C}blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT and average them to obtain an object-specific prompt qk=1/Nkj=1NkQk,jsubscriptq𝑘1subscript𝑁𝑘superscriptsubscript𝑗1subscript𝑁𝑘subscript𝑄𝑘𝑗\textbf{q}_{k}=1/N_{k}\cdot\sum_{j=1}^{N_{k}}Q_{k,j}q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 / italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT. For each scene, we map each object instance n[1,N]𝑛1𝑁n\in[1,N]italic_n ∈ [ 1 , italic_N ] to its positive prompt qn+superscriptsubscriptq𝑛\textbf{q}_{n}^{+}q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, as well as a set Qn𝒬{qn+}approaches-limitsubscriptsuperscript𝑄𝑛𝒬superscriptsubscriptq𝑛Q^{-}_{n}\doteq\mathcal{Q}-\{\textbf{q}_{n}^{+}\}italic_Q start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≐ caligraphic_Q - { q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT } of negative prompts corresponding to all other instances. We define our semantic informativeness metric as:

Gv,i=cos(zv,i2D,qni+)maxqQnicos(zv,i2D,q)subscript𝐺𝑣𝑖cossubscriptsuperscriptz2𝐷𝑣𝑖subscriptsuperscriptqsubscript𝑛𝑖subscriptmaxsimilar-toqsubscriptsuperscript𝑄subscript𝑛𝑖cossubscriptsuperscriptz2𝐷𝑣𝑖qG_{v,i}=\texttt{cos}(\textbf{z}^{2D}_{v,i},\textbf{q}^{+}_{n_{i}})-\texttt{max% }_{\textbf{q}\sim Q^{-}_{n_{i}}}\texttt{cos}(\textbf{z}^{2D}_{v,i},\textbf{q})italic_G start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT = cos ( z start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT , q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - max start_POSTSUBSCRIPT q ∼ italic_Q start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT cos ( z start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT , q ) (3)

Intuitively, we want a 2D feature from view v𝑣vitalic_v to contribute to the overall 3D feature of point i𝑖iitalic_i according to how much its similarity with the correct object instance is higher than the maximum similarity to any of the negative object instances, hence offering a proxy for semantic informativeness. We clip this weight to 0 to eliminate views that don’t satisfy the condition Gv,i0subscript𝐺𝑣𝑖0G_{v,i}\geq 0italic_G start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT ≥ 0. Plugging in our metric in equation (2) already provides improvements over vanilla average pooling (see Sec. 4.1), however, does not deal with 3D spatial consistency, for which we employ our spatial priors below.

Object-level 2D CLIP features For obtaining object-level 2D CLIP features, we isolate the pixels for each object n𝑛nitalic_n from each view v𝑣vitalic_v from Sv,n2Dsubscriptsuperscript𝑆2𝐷𝑣𝑛S^{2D}_{v,n}italic_S start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT and crop a bounding box around the mask from Ivsubscript𝐼𝑣I_{v}italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT: zv,n2D=fcls2D(cropmask(Iv,Sv,n2D))subscriptsuperscriptz2𝐷𝑣𝑛subscriptsuperscript𝑓2𝐷𝑐𝑙𝑠cropmasksubscript𝐼𝑣subscriptsuperscript𝑆2𝐷𝑣𝑛\textbf{z}^{2D}_{v,n}=f^{2D}_{cls}\left(\texttt{cropmask}(I_{v},\,S^{2D}_{v,n}% )\right)z start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ( cropmask ( italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_S start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT ) ) (see Appendix C for ablations in CLIP visual prompts). Here we use fcls2D:hn×wn×3C:subscriptsuperscript𝑓2𝐷𝑐𝑙𝑠superscriptsubscript𝑛subscript𝑤𝑛3superscript𝐶f^{2D}_{cls}:\mathbb{R}^{h_{n}\times w_{n}\times 3}\rightarrow\mathbb{R}^{C}italic_f start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, i.e., only the [CLS] feature of CLIP’s ViT encoder, to represent an object crop of size hn×wnsubscript𝑛subscript𝑤𝑛h_{n}\times w_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. We can now define our metric from equation (3) also at object-level:

Gv,n=cos(zv,n2D,qn+)maxqQncos(zv,n2D,q)subscript𝐺𝑣𝑛cossubscriptsuperscriptz2𝐷𝑣𝑛subscriptsuperscriptq𝑛subscriptmaxsimilar-toqsubscriptsuperscript𝑄𝑛cossubscriptsuperscriptz2𝐷𝑣𝑛qG_{v,n}=\texttt{cos}(\textbf{z}^{2D}_{v,n},\textbf{q}^{+}_{n})-\texttt{max}_{% \textbf{q}\sim Q^{-}_{n}}\texttt{cos}(\textbf{z}^{2D}_{v,n},\textbf{q})italic_G start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT = cos ( z start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT , q start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - max start_POSTSUBSCRIPT q ∼ italic_Q start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT cos ( z start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT , q ) (4)

where Gv,n𝒱×Nsubscript𝐺𝑣𝑛superscript𝒱𝑁G_{v,n}\in\mathbb{R}^{\mathcal{V}\times N}italic_G start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT caligraphic_V × italic_N end_POSTSUPERSCRIPT now represents the semantic informativeness of view v𝑣vitalic_v for object instance n𝑛nitalic_n.

Fusing object-wise features A 3D object-level feature can be obtained by fusing 2D object-level features across views similar to equation (2):

𝐳n3D=v=1𝒱𝐳v,n2Dωv,nv=1𝒱ωv,n=v=1𝒱𝐳v,n2DΛv,nGv,nv=1𝒱Λv,nGv,nsubscriptsuperscript𝐳3𝐷𝑛superscriptsubscript𝑣1𝒱subscriptsuperscript𝐳2𝐷𝑣𝑛subscript𝜔𝑣𝑛superscriptsubscript𝑣1𝒱subscript𝜔𝑣𝑛superscriptsubscript𝑣1𝒱subscriptsuperscript𝐳2𝐷𝑣𝑛subscriptΛ𝑣𝑛subscript𝐺𝑣𝑛superscriptsubscript𝑣1𝒱subscriptΛ𝑣𝑛subscript𝐺𝑣𝑛\mathbf{z}^{3D}_{n}=\frac{\sum_{v=1}^{\mathcal{V}}\mathbf{z}^{2D}_{v,n}\cdot% \omega_{v,n}}{\sum_{v=1}^{\mathcal{V}}\omega_{v,n}}=\frac{\sum_{v=1}^{\mathcal% {V}}\mathbf{z}^{2D}_{v,n}\cdot\Lambda_{v,n}\cdot G_{v,n}}{\sum_{v=1}^{\mathcal% {V}}\Lambda_{v,n}\cdot G_{v,n}}bold_z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT bold_z start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT ⋅ italic_ω start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT bold_z start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT ⋅ roman_Λ start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT ⋅ italic_G start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT roman_Λ start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT ⋅ italic_G start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT end_ARG (5)

where each view is weighted by its semantic informativeness metric Gv,nsubscript𝐺𝑣𝑛G_{v,n}italic_G start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT, as well as optionally a visibility metric Λv,n=Sv,n2DsubscriptΛ𝑣𝑛subscriptsubscriptsuperscript𝑆2𝐷𝑣𝑛\Lambda_{v,n}=\sum_{\textbf{}}S^{2D}_{v,n}roman_Λ start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_n end_POSTSUBSCRIPT that measures the number of pixels from n𝑛nitalic_n-th object’s mask that are visible from view v𝑣vitalic_v  [6]. We finally reconstruct the full feature-cloud Z3DM×Csuperscript𝑍3𝐷superscript𝑀𝐶Z^{3D}\in\mathbb{R}^{M\times C}italic_Z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_C end_POSTSUPERSCRIPT by equating each point’s feature to its corresponding 3D object-level one via: 𝐳i3D=𝐳ni3D,ni=argmaxνSi,ν3Dformulae-sequencesubscriptsuperscript𝐳3𝐷𝑖subscriptsuperscript𝐳3𝐷subscript𝑛𝑖subscript𝑛𝑖subscriptargmax𝜈subscriptsuperscript𝑆3𝐷𝑖𝜈\mathbf{z}^{3D}_{i}=\mathbf{z}^{3D}_{n_{i}},\;n_{i}=\texttt{argmax}_{\nu}\;S^{% 3D}_{i,\nu}bold_z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = argmax start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_ν end_POSTSUBSCRIPT.

3.3 View-Independent Feature Distillation

Even though the above feature-cloud Z3Dsuperscript𝑍3𝐷Z^{3D}italic_Z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT could be directly used for open-vocabulary grounding in 3D, its construction is computationally intensive and requires a lot of expensive resources, such as access to multiple camera views, view-aligned 2D instance segmentation masks, as well as a set of text descriptions to compute informativeness metrics. Such utilities are rarely available in open-ended scenarios, especially in robotic applications, where usually only single-view RGB-D images from sensors mounted on the robot are provided. To tackle this, we wish to distill all the above knowledge from the feature-cloud Z3Dsuperscript𝑍3𝐷Z^{3D}italic_Z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT into a 3D encoder that receives only a partial point-cloud from single-view posed RGB-D. Hence, the only assumption that we make during inference is access to camera intrinsic and extrinsic parameters, which is a mild requirement in most robotic works.

In particular, given a partial colored point-cloud from view v𝑣vitalic_v: PvMv×6subscript𝑃𝑣superscriptsubscript𝑀𝑣6P_{v}\in\mathbb{R}^{M_{v}\times 6}italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × 6 end_POSTSUPERSCRIPT (3D coordinates plus colors), we train a 3D encoder θ:Mv×6Mv×C:subscript𝜃superscriptsubscript𝑀𝑣6superscriptsubscript𝑀𝑣𝐶\mathcal{E}_{\theta}:\mathbb{R}^{M_{v}\times 6}\rightarrow\mathbb{R}^{M_{v}% \times C}caligraphic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × 6 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT such that θ(Pv)=Z3Dsubscript𝜃subscript𝑃𝑣superscript𝑍3𝐷\mathcal{E}_{\theta}(P_{v})=Z^{3D}caligraphic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) = italic_Z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT. Notice that the distillation target Z3Dsuperscript𝑍3𝐷Z^{3D}italic_Z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT is independent of view v𝑣vitalic_v. Following  [12, 15] we use cosine distance loss:

(θ)=1cos(θ(Pv),Z3D)𝜃1cossubscript𝜃subscript𝑃𝑣superscript𝑍3𝐷\mathcal{L}(\theta)=1-\texttt{cos}(\mathcal{E_{\theta}}(P_{v}),\,Z^{3D})caligraphic_L ( italic_θ ) = 1 - cos ( caligraphic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , italic_Z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT ) (6)

See Appendix B.2 for training implementation details. With such a setup, we can obtain 3D features that: (i) are co-embedded in CLIP text space, so they can be leveraged for 3D segmentation tasks from open-vocabulary queries via computing cosine similarities between CLIP text embeddings Q𝑄Qitalic_Q and the predicted feature cloud: S^i=argmaxicos(𝐳^i3D,Q)subscript^𝑆𝑖subscriptargmax𝑖cossuperscriptsubscript^𝐳𝑖3𝐷𝑄\hat{S}_{i}=\texttt{argmax}_{i}\;\texttt{cos}(\hat{\mathbf{z}}_{i}^{3D},Q)over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = argmax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT cos ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT , italic_Q ), (ii) are ensured to be optimally informative per object, due to the usage of the semantic informativeness metric to compute Z3Dsuperscript𝑍3𝐷Z^{3D}italic_Z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT, (iii) maintain 3D spatial consistency in object boundaries, due to performing object-wise instead of point-wise fusion when computing Z3Dsuperscript𝑍3𝐷Z^{3D}italic_Z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT, and (iv) are encouraged to be view-independent, as the same features Z3Dsuperscript𝑍3𝐷Z^{3D}italic_Z start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT are utilized as distillation targets regardless of the input view v𝑣vitalic_v. Importantly, no labels, prompts, or segmentation masks are needed at test-time to reproduce the fused feature-cloud, while obtaining it amounts to a single forward pass of our 3D encoder, hence offering real-time performance.

Refer to caption
Figure 4: Open-Vocabulary 3D Referring Segmentation. Examples of learned 3D features and grounding heatmaps from open-ended language queries (class names, attributes, user affordances, and open instance-specific concepts) in scenes from MV-TOD dataset. Points are colored based on their query similarity (higher towards red). We note that table points are excluded from similarity computation in our visualizations.

4 Experiments

In our experiments, we explore the following questions: (i) Sec. 4.1: What are the contributions of our proposed object-centric priors for multi-view feature fusion? Does the dense number of views of our proposed dataset also contribute? (ii) Sec. 4.2: How does our method compare to previous open-vocabulary approaches for 3D semantic and referring segmentation tasks? Are the learned features robust to open-ended language? (iii) Sec. 4.3: What are the generalization capabilities of our learned 3D representation in novel domains and novel tasks (3D instance segmentation)? (iv) Sec. 4.4: Can we leverage our 3D learned representation for language-guided 6-DoF robotic grasping?

4.1 Multi-view Feature Fusion Ablation Studies

Fusion 𝐟𝟐𝐃superscript𝐟2𝐃\mathbf{f^{2D}}bold_f start_POSTSUPERSCRIPT bold_2 bold_D end_POSTSUPERSCRIPT 𝚲𝐯,𝐢subscript𝚲𝐯𝐢\mathbf{\Lambda_{v,i}}bold_Λ start_POSTSUBSCRIPT bold_v , bold_i end_POSTSUBSCRIPT 𝐆𝐯,𝐢subscript𝐆𝐯𝐢\mathbf{G_{v,i}}bold_G start_POSTSUBSCRIPT bold_v , bold_i end_POSTSUBSCRIPT Ref.Segm (%)
mIoU Pr@25 Pr@50 Pr@75
point patch 44.2 59.9 41.4 27.0
point patch 37.3 55.4 33.7 16.7
point patch 57.0 74.1 59.5 40.9
point patch 57.4 77.0 60.9 39.9
obj obj 65.6 67.0 65.4 64.1
obj obj 67.3 68.7 67.1 65.8
obj obj 83.1 83.9 83.1 82.4
obj obj 80.9 83.1 80.2 79.7
Table 2: Multi-view feature fusion ablation study for 3D referring segmentation in MV-TOD.

To evaluate the contributions of our proposed object-centric priors, we conduct ablation studies on the multi-view feature fusion pipeline, where we compare 3D referring segmentation results of obtained 3D features in held-out scenes of MV-TOD. We highlight that here we aim to establish a performance upper bound that the feature fusion method can provide for distillation, and not the distilled features themselves.

Refer to caption
Figure 5: Referring segmentation accuracy (Pr@25 (%)) vs. number of utilized views.

We ablate: (i) patch-wise vs. object-wise fusion, (ii) MaskCLIP [8] patch-level vs. CLIP [7] masked crop features, (iii) inclusion of visibility (Λv,isubscriptΛ𝑣𝑖\Lambda_{v,i}roman_Λ start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT) and semantic informativeness (Gv,isubscript𝐺𝑣𝑖G_{v,i}italic_G start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT) metrics for view selection. Results in Table 2.

Effect of object-centric priors We observe that all components contribute positively to the quality of the 3D features. Our proposed Gv,isubscript𝐺𝑣𝑖G_{v,i}italic_G start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT metric boosts mIoU across both point- and object-wise fusion (57.0%percent57.057.0\%57.0 % vs. 44.2%percent44.244.2\%44.2 % and 83.1%percent83.183.1\%83.1 % vs. 65.6%percent65.665.6\%65.6 % respectively). Further, we observe that the usage of spatial priors for object-wise fusion and object-level features leads to both higher segmentation crispness (25.7%percent25.725.7\%25.7 % mIoU delta), as well as higher grounding precision (42.5%percent42.542.5\%42.5 % Pr@75 delta). See qualitative comparisons in Appendix D.

Effect of the number of views We ablate the 3D referring segmentation performance based on the number of input views in Fig. 5, where novel viewpoints are added incrementally. We observe that in both setups (point- and object-wise) fusing features from more views leads to improvements, with a small plateauing behavior around 40 views. We believe this is an encouraging result for leveraging dense multi-view coverage in feature distillation pipelines, as we propose with the introduction of MV-TOD.

4.2 Open-Vocabulary 3D Segmentation Results

Method #views Ref.Segm. (%) Sem.Segm (%)
mIoU Pr@25 Pr@50 Pr@75 mIoU mAcc
OpenScene  [12] 73 29.32 44.00 24.51 11.26 21.79 32.14
OpenMask3D  [6] 73 65.38 73.05 63.99 57.40 59.47 66.48
DROP-CLIP (Ours) 73 82.67 86.11 82.43 79.23 75.41 80.02
DROP-CLIP (Ours) 73 66.56 75.73 67.55 59.88 62.04 70.74
OpenSeg→3D  [9] 1 12.89 17.36 2.38 0.23 12.83 17.21
MaskCLIP→3D  [8] 1 25.64 40.36 18.69 6.95 20.97 32.09
DROP-CLIP (Ours) 1 62.31 71.96 62.75 53.85 54.48 64.41
Table 3: Referring and Semantic segmentation results on MV-TOD test split. Methods with denote upper-bound 3D features, whereas DROP-CLIP denotes our distilled model. Methods with →3D produce 2D predictions that are projected to 3D to compute metrics. Method with * denotes further usage of ground-truth segmentation masks.

In this section, we compare referring and semantic segmentation performance of our distilled features vs. previous open-vocabulary approaches, both in multi-view and in single-view settings. For multi-view, we compare our trained model with OpenScene [12] and OpenMask3D [6] methods, where the full point-cloud from all 73737373 views is given as input.

Refer to caption
Figure 6: Referring segmentation accuracy (Pr@25 (%)) vs. different language query types.

We note that for these baselines we obtain the upper-bound 3D features as before, as we observed that our trained model already outperforms them, so we refrained from also distilling features from baselines (details in Appendix C2). For single-view, we feed our network with partial point-cloud from projected RGB-D pair, and compare with 2D baselines MaskCLIP [8] and OpenSeg [9]. Our model slightly outperforms the OpenMask3D upper bound baseline in the multi-view setting (+1.18%percent1.18+1.18\%+ 1.18 % in referring and +2.57%percent2.57+2.57\%+ 2.57 % in semantic segmentation), while significantly outperforming 2D baselines in the single-view setting (>30%absentpercent30>30\%> 30 % in both tasks). Importantly, single-view results closely match the multi-view ones (4.0%similar-toabsentpercent4.0\sim-4.0\%∼ - 4.0 %), suggesting that DROP-CLIP indeed learns view-independent features.

Open-ended queries We evaluate the robustness of our model in different types of input language queries, organized in 4 families (class name - e.g. “cereal box", class + attribute - e.g. “brown cereal box", open - e.g. “chocolate Kellogs", and affordance - e.g. “I want something sweet‘). Comparative results are presented in Fig. 6 and qualitative in Fig. 4. We observe that our method achieves high grounding accuracy in all query types, even when using single-view.

4.3 Zero-Shot Transfer to Novel Domains / Tasks

Method OCID-VLG  [31] REGRAD  [30]
IoU Pr@25 IoU Pr@25
MaskCLIP→3D  [8] 24.1 30.9 33.2 39.0
DROP-CLIP (Ours) 46.2 48.9 59.1 63.0
Table 4: Referring segmentation results in OCID-VLG [31] and REGRAD [30] datasets

Generalization to Novel Domains We evaluate the 3D referring segmentation performance of our trained model when applied zero-shot in novel tabletop domains. We test in 500 scenes from OCID-VLG [31] using the dataset’s instance-wise open queries, as well as in 1000 scenes from REGRAD [30], using each model’s class name as a query. Only single-view input is provided for both datasets.

Method mIoU𝑚𝐼𝑜𝑈mIoUitalic_m italic_I italic_o italic_U AP25𝐴subscript𝑃25AP_{25}italic_A italic_P start_POSTSUBSCRIPT 25 end_POSTSUBSCRIPT AP50𝐴subscript𝑃50AP_{50}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT
SAM  [39] 70.11 95.26 79.88
DROP-CLIP (S) 80.83 91.92 86.83
Mask3D  [40] 14.41 18.65 3.41
DROP-CLIP (F) 88.37 93.13 91.47
Table 5: Zero-shot 3D instance segmentation results in MV-TOD.

We compare with MaskCLIP [8] as above and report results in Table 4. We note that test datasets contain both novel object instances (REGRAD) and classes (OCID-VLG). We observe that our method provides a significant performance boost across both domains (22.1%percent22.122.1\%22.1 % mIoU delta in OCID-VLG and 25.9%percent25.925.9\%25.9 % in REGRAD).

Zero-Shot 3D Instance Segmentation Since our method has been distilled from features with object-level priors, we demonstrate that it can be used out-of-the-box for 3D instance segmentation, via clustering the 3D features (see Appendix E for implementation details). We report results in MV-TOD in Table 5, where we compare with SAM [39] with single-view images, as well as Mask3D  [41] with full point-clouds (transferred from ScanRefer [1] with room layout). Mask3D struggles to generalize to tabletop domains, whereas our method achieves comparable performance with SAM for segmenting from single-view, even without being explicitly trained for instance segmentation.

4.4 Open-Vocabulary Language-guided Robotic Grasping

Refer to caption
Figure 7: Language-guided 6-DoF grasping trial with real robot (left), 3D features, grounding and grasp proposals (right).

.

In this section, we wish to illustrate the applicability of DROP-CLIP in a language-guided robotic grasping scenario. We integrate our method with a 6-DoF grasp detection network [42], to segment and then propose gripper poses for picking a target object indicated verbally. We randomly place 5-12 objects on a tabletop with different levels of clutter, and query the robot to pick the target object and place it in a fixed position. The user instruction is open-vocabulary and can involve open object descriptions, attributes, or affordances. We conducted 50 trials in Gazebo [43] and 10 with a real robot, and observed grounding accuracy of 84% and 80% respectively, and a final success rate of 64% and 60%, where failures were mostly due to grasp proposals that are outside of the robot’s kinematic range or motion planning that lead to a collision with other objects and the table. Our setup and example trials are shown in Fig. 7, while more details and qualitative results are provided in Appendix E. A video of robot demonstrations is provided as supplementary material.

5 Related work

3D Scene Understanding There’s a long line of works in closed-set 3D scene understanding [44, 45, 46, 47, 48, 49], applied in 3D classification [50, 51], localization [52, 1] and segmentation [53, 23, 22], using two-stage pipelines with instance proposals from point-clouds [54, 55] or RGB-D views  [56, 27], or single-stage methods [3] that leverage 3D-language cross attentions. [57] use CLIP embeddings for pretraining a 3D segmentation model, but still cannot be applied open-vocabulary.

Open-Vocabulary Grounding with CLIP Following the impressive results of CLIP [7] for open-set image recognition, followup works transfer CLIP’s powerful representations from image- to pixel-level [40, 58, 59, 60, 61, 62, 63, 9, 10, 8], extending to detection / segmentation, but limited to 2D. For 3D segmentation, the closest work is perhaps OpenMask3D [6] that extracts multi-view CLIP features from instance proposals from Mask3D [41] to compute similarities with open text queries.

3D CLIP Feature Distillation Recent works distill features from 2D foundation models with point-cloud encoders [12, 14, 21] or neural fields [13, 19, 17, 18, 19, 24], with applications in robot manipulation [20, 16] and navigation [64, 65]. However, associated works extract 2D features from OpenSeg [9], LSeg [10], MaskCLIP [8] or multi-scale crops from CLIP [7] and fuse point-wise with average pooling, while our approach leverages semantics-informed view selection and segmentation masks to do object-wise fusion with object-level features (see detailed overview in Appendix F).

6 Conclusion, Limitations and Future Work

We propose DROP-CLIP, a 2D\rightarrow3D CLIP feature distillation framework that employs object-centric priors to select views based on semantic informativeness and ensure crisp 3D segmentations, while working with single-view RGB-D. We also release MV-TOD, a large-scale synthetic dataset of multi-view tabletop scenes with dense annotations that can be leveraged for several downstream tasks. We hope our work can benefit the robotics community, both in terms of released resources as well as illustrating and overcoming theoretical limitations of existing 3D feature distillation works.

While our spatial object-centric priors lead to improved segmentation quality, they collapse local features in favor of a global object-level feature, and hence cannot be applied for segmenting object parts. In the future, we plan to add object part annotations in our dataset and fuse with both object- and part-level masks. Second, DROP-CLIP only provides grounding and a two-stage pipeline is needed for grasping, while our dataset already provides rich 6-DoF grasp annotations. A next step would be to also distill them, opting for a joint 3D representation for grounding and grasping.

References

  • Chen et al. [2020] D. Z. Chen, A. X. Chang, and M. Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 202–221. Springer, 2020.
  • Achlioptas et al. [2020] P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. 16th European Conference on Computer Vision (ECCV), 2020.
  • Luo et al. [2022] J. Luo, J. Fu, X. Kong, C. Gao, H. Ren, H. Shen, H. Xia, and S. Liu. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16454–16463, 2022.
  • Huang et al. [2021] P.-H. Huang, H.-H. Lee, H.-T. Chen, and T.-L. Liu. Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1610–1618, 2021.
  • Qian et al. [2024] Z. Qian, Y. Ma, J. Ji, and X. Sun. X-refseg3d: Enhancing referring 3d instance segmentation via structured cross-modal graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4551–4559, 2024.
  • Takmaz et al. [2023] A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann. Openmask3d: Open-vocabulary 3d instance segmentation. ArXiv, abs/2306.13631, 2023. URL https://api.semanticscholar.org/CorpusID:259243888.
  • Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021. URL https://arxiv.org/abs/2103.00020.
  • Dong et al. [2022] X. Dong, Y. Zheng, J. Bao, T. Zhang, D. Chen, H. Yang, M. Zeng, W. Zhang, L. Yuan, D. Chen, F. Wen, and N. Yu. Maskclip: Masked self-distillation advances contrastive language-image pretraining. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10995–11005, 2022. URL https://api.semanticscholar.org/CorpusID:251799827.
  • Ghiasi et al. [2021] G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, 2021. URL https://api.semanticscholar.org/CorpusID:250895808.
  • Li et al. [2022] B. Li, K. Q. Weinberger, S. J. Belongie, V. Koltun, and R. Ranftl. Language-driven semantic segmentation. ArXiv, abs/2201.03546, 2022. URL https://api.semanticscholar.org/CorpusID:245836975.
  • Oquab et al. [2023] M. Oquab, T. Darcet, T. Moutakanni, H. Q. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. B. Huang, S.-W. Li, I. Misra, M. G. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without supervision. ArXiv, abs/2304.07193, 2023. URL https://api.semanticscholar.org/CorpusID:258170077.
  • Peng et al. [2022] S. Peng, K. Genova, ChiyuMaxJiang, A. Tagliasacchi, M. Pollefeys, and T. A. Funkhouser. Openscene: 3d scene understanding with open vocabularies. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–824, 2022. URL https://api.semanticscholar.org/CorpusID:254044069.
  • Kerr et al. [2023] J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik. Lerf: Language embedded radiance fields. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 19672–19682, 2023. URL https://api.semanticscholar.org/CorpusID:257557329.
  • Nguyen et al. [2023] P. D. Nguyen, T. Ngo, C. Gan, E. Kalogerakis, A. D. Tran, C. Pham, and K. Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. ArXiv, abs/2312.10671, 2023. URL https://api.semanticscholar.org/CorpusID:266348609.
  • Koch et al. [2024] S. Koch, N. Vaskevicius, M. Colosi, P. Hermosilla, and T. Ropinski. Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships. ArXiv, abs/2402.12259, 2024. URL https://api.semanticscholar.org/CorpusID:267750890.
  • Shen et al. [2023] B. W. Shen, G. Yang, A. Yu, J. R. Wong, L. P. Kaelbling, and P. Isola. Distilled feature fields enable few-shot language-guided manipulation. In Conference on Robot Learning, 2023. URL https://api.semanticscholar.org/CorpusID:260926035.
  • Tschernezki et al. [2022] V. Tschernezki, I. Laina, D. Larlus, and A. Vedaldi. Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. 2022 International Conference on 3D Vision (3DV), pages 443–453, 2022. URL https://api.semanticscholar.org/CorpusID:252118532.
  • Kobayashi et al. [2022] S. Kobayashi, E. Matsumoto, and V. Sitzmann. Decomposing nerf for editing via feature field distillation. ArXiv, abs/2205.15585, 2022. URL https://api.semanticscholar.org/CorpusID:249209811.
  • Engelmann et al. [2024] F. Engelmann, F. Manhardt, M. Niemeyer, K. Tateno, M. Pollefeys, and F. Tombari. Opennerf: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views, 2024.
  • Rashid et al. [2023] A. Rashid, S. Sharma, C. M. Kim, J. Kerr, L. Y. Chen, A. Kanazawa, and K. Goldberg. Language embedded radiance fields for zero-shot task-oriented grasping. In Conference on Robot Learning, 2023. URL https://api.semanticscholar.org/CorpusID:261882332.
  • Zhang et al. [2023] J. Zhang, R. Dong, and K. Ma. Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2040–2051, 2023. URL https://api.semanticscholar.org/CorpusID:257404908.
  • Dai et al. [2017] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017.
  • Ramakrishnan et al. [2021] S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y. Zhao, and D. Batra. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. ArXiv, abs/2109.08238, 2021. URL https://api.semanticscholar.org/CorpusID:237563216.
  • Qin et al. [2024] M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister. Langsplat: 3d language gaussian splatting, 2024.
  • Chen et al. [2022] M. Chen, Q. Hu, Z. Yu, H. Thomas, A. Feng, Y. Hou, K. McCullough, F. Ren, and L. Soibelman. Stpls3d: A large-scale synthetic and real aerial photogrammetry 3d point cloud dataset. arXiv preprint arXiv:2203.09065, 2022.
  • Straub et al. [2019] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  • Liu et al. [2021] H. Liu, A. Lin, X. Han, L. Yang, Y. Yu, and S. Cui. Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6028–6037, 2021.
  • Mauceri et al. [2019] C. Mauceri, M. Palmer, and C. Heckman. Sun-spot: An rgb-d dataset with spatial referring expressions. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 1883–1886, 2019.
  • Fang et al. [2020] H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020.
  • Zhang et al. [2022] H. Zhang, D. Yang, H. Wang, B. Zhao, X. Lan, J. Ding, and N. Zheng. Regrad: A large-scale relational grasp dataset for safe and object-specific robotic grasping in clutter. IEEE Robotics and Automation Letters, 7(2):2929–2936, 2022.
  • Tziafas et al. [2023] G. Tziafas, X. Yucheng, A. Goel, M. Kasaei, Z. Li, and H. Kasaei. Language-guided robot grasping: Clip-based referring grasp synthesis in clutter. In 7th Annual Conference on Robot Learning, 2023.
  • Vuong et al. [2023] A. D. Vuong, M. N. Vu, H. Le, B. Huang, B. P. K. Huynh, T. D. Vo, A. Kugi, and A. Nguyen. Grasp-anything: Large-scale grasp dataset from foundation models. ArXiv, abs/2309.09818, 2023. URL https://api.semanticscholar.org/CorpusID:262045996.
  • Armeni et al. [2016] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016.
  • Rozenberszki et al. [2022] D. Rozenberszki, O. Litany, and A. Dai. Language-grounded indoor 3d semantic segmentation in the wild. In European Conference on Computer Vision, pages 125–141. Springer, 2022.
  • Eppner et al. [2020] C. Eppner, A. Mousavian, and D. Fox. ACRONYM: A large-scale grasp dataset based on simulation. In 2021 IEEE Int. Conf. on Robotics and Automation, ICRA, 2020.
  • Community [2018] B. O. Community. Blender - a 3d modelling and rendering package. 2018. URL http://www.blender.org.
  • Chang et al. [2015] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
  • GPT [2023] Gpt-4v(ision) system card. 2023. URL https://api.semanticscholar.org/CorpusID:263218031.
  • Kirillov et al. [2023] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. B. Girshick. Segment anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3992–4003, 2023. URL https://api.semanticscholar.org/CorpusID:257952310.
  • Gu et al. [2021] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui. Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2021. URL https://api.semanticscholar.org/CorpusID:238744187.
  • Schult et al. [2023] J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe. Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. 2023.
  • Chen et al. [2023] S. Chen, W. N. Tang, P. Xie, W. Yang, and G. Wang. Efficient heatmap-guided 6-dof grasp detection in cluttered scenes. IEEE Robotics and Automation Letters, 8:4895–4902, 2023. URL https://api.semanticscholar.org/CorpusID:259363869.
  • Koenig and Howard [2004] N. P. Koenig and A. Howard. Design and use paradigms for gazebo, an open-source multi-robot simulator. 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), 3:2149–2154 vol.3, 2004.
  • Choy et al. [2019] C. B. Choy, J. Gwak, and S. Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3070–3079, 2019. URL https://api.semanticscholar.org/CorpusID:121123422.
  • Han et al. [2020] L. Han, T. Zheng, L. Xu, and L. Fang. Occuseg: Occupancy-aware 3d instance segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2937–2946, 2020. URL https://api.semanticscholar.org/CorpusID:212725768.
  • Hu et al. [2021a] W. Hu, H. Zhao, L. Jiang, J. Jia, and T.-T. Wong. Bidirectional projection network for cross dimension scene understanding. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14368–14377, 2021a. URL https://api.semanticscholar.org/CorpusID:232379958.
  • Hu et al. [2021b] Z. Hu, X. Bai, J. Shang, R. Zhang, J. Dong, X. Wang, G. Sun, H. Fu, and C.-L. Tai. Vmnet: Voxel-mesh network for geodesic-aware 3d semantic segmentation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15468–15478, 2021b. URL https://api.semanticscholar.org/CorpusID:236493200.
  • Li et al. [2022] J. Li, X. He, Y. Wen, Y. Gao, X. Cheng, and D. Zhang. Panoptic-phnet: Towards real-time and high-precision lidar panoptic segmentation via clustering pseudo heatmap. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11799–11808, 2022. URL https://api.semanticscholar.org/CorpusID:248811224.
  • Robert et al. [2022] D. Robert, B. Vallet, and L. Landrieu. Learning multi-view aggregation in the wild for large-scale 3d semantic segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5565–5574, 2022. URL https://api.semanticscholar.org/CorpusID:248218804.
  • Wu et al. [2014] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1912–1920, 2014. URL https://api.semanticscholar.org/CorpusID:206592833.
  • Zhang et al. [2021] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. J. Qiao, P. Gao, and H. Li. Pointclip: Point cloud understanding by clip. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8542–8552, 2021. URL https://api.semanticscholar.org/CorpusID:244909021.
  • Caesar et al. [2019] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2019. URL https://api.semanticscholar.org/CorpusID:85517967.
  • Behley et al. [2019] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9296–9306, 2019. URL https://api.semanticscholar.org/CorpusID:199441943.
  • Achlioptas et al. [2020] P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. J. Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In European Conference on Computer Vision, 2020.
  • Zhao et al. [2021] L. Zhao, D. Cai, L. Sheng, and D. Xu. 3dvg-transformer: Relation modeling for visual grounding on point clouds. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2908–2917, 2021.
  • Huang et al. [2022] S. Huang, Y. Chen, J. Jia, and L. Wang. Multi-view transformer for 3d visual grounding. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15503–15512, 2022.
  • Rozenberszki et al. [2022] D. Rozenberszki, O. Litany, and A. Dai. Language-grounded indoor 3d semantic segmentation in the wild. ArXiv, abs/2204.07761, 2022. URL https://api.semanticscholar.org/CorpusID:248227627.
  • Zhong et al. [2021] Y. Zhong, J. Yang, P. Zhang, C. Li, N. C. F. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li, and J. Gao. Regionclip: Region-based language-image pretraining. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16772–16782, 2021. URL https://api.semanticscholar.org/CorpusID:245218534.
  • Minderer et al. [2022] M. Minderer, A. A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, X. Wang, X. Zhai, T. Kipf, and N. Houlsby. Simple open-vocabulary object detection with vision transformers. ArXiv, abs/2205.06230, 2022. URL https://api.semanticscholar.org/CorpusID:248721818.
  • Zhou et al. [2022] X. Zhou, R. Girdhar, A. Joulin, P. Krahenbuhl, and I. Misra. Detecting twenty-thousand classes using image-level supervision. ArXiv, abs/2201.02605, 2022. URL https://api.semanticscholar.org/CorpusID:245827815.
  • Minderer et al. [2023] M. Minderer, A. A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection. ArXiv, abs/2306.09683, 2023. URL https://api.semanticscholar.org/CorpusID:259187664.
  • Wang et al. [2021] Z. Wang, Y. Lu, Q. Li, X. Tao, Y. Guo, M. Gong, and T. Liu. Cris: Clip-driven referring image segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11676–11685, 2021. URL https://api.semanticscholar.org/CorpusID:244729320.
  • Lüddecke and Ecker [2021] T. Lüddecke and A. S. Ecker. Image segmentation using text and image prompts. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7076–7086, 2021. URL https://api.semanticscholar.org/CorpusID:247794227.
  • Shafiullah et al. [2022] N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory. ArXiv, abs/2210.05663, 2022. URL https://api.semanticscholar.org/CorpusID:252815898.
  • Bolte et al. [2023] B. Bolte, A. S. Wang, J. Yang, M. Mukadam, M. Kalakrishnan, and C. Paxton. Usa-net: Unified semantic and affordance representations for robot memory. 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8, 2023. URL https://api.semanticscholar.org/CorpusID:258298248.