Negative as Positive: Enhancing Out-of-distribution Generalization for Graph Contrastive Learning

Zixu Wang 0009-0006-1327-6366 CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of SciencesUniversity of Chinese Academy of SciencesBeijingChina wangzixu22s@ict.ac.cn Bingbing Xu 0000-0002-0147-2590 CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of SciencesBeijingChina xubingbing@ict.ac.cn Yige Yuan 0000-0001-8856-668X CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of SciencesUniversity of Chinese Academy of SciencesBeijingChina yuanyige20z@ict.ac.cn Huawei Shen 0000-0003-2425-1499 CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of SciencesBeijingChina shenhuawei@ict.ac.cn  and  Xueqi Cheng 0000-0002-5201-8195 CAS Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of SciencesBeijingChina cxq@ict.ac.cn
(2024)
Abstract.

Graph contrastive learning (GCL), standing as the dominant paradigm in the realm of graph pre-training, has yielded considerable progress. Nonetheless, its capacity for out-of-distribution (OOD) generalization has been relatively underexplored. In this work, we point out that the traditional optimization of InfoNCE in GCL restricts the cross-domain pairs only to be negative samples, which inevitably enlarges the distribution gap between different domains. This violates the requirement of domain invariance under OOD scenario and consequently impairs the model’s OOD generalization performance. To address this issue, we propose a novel strategy “Negative as Positive”, where the most semantically similar cross-domain negative pairs are treated as positive during GCL. Our experimental results, spanning a wide array of datasets, confirm that this method substantially improves the OOD generalization performance of GCL.

Graph Representation Learning; Graph OOD Generalization; Graph Contrastive Learning
journalyear: 2024copyright: rightsretainedconference: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 14–18, 2024; Washington, DC, USAbooktitle: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), July 14–18, 2024, Washington, DC, USAdoi: 10.1145/3626772.3657927isbn: 979-8-4007-0431-4/24/07ccs: Computing methodologies Machine learning

1. Introduction

Graph Contrastive Learning (GCL) with supervised fine-tuning has emerged as the dominant paradigm for graph pre-training, exhibiting remarkable performance across diverse downstream tasks while requiring only a limited amount of labeled data(Zhu et al., 2020, 2021; Hafidi et al., [n. d.]; Thakoor et al., 2021; Zeng and Xie, 2021; Qiu et al., 2020; Wang et al., 2021; You et al., 2021; Verma et al., 2021; Mavromatis and Karypis, 2021; Jin et al., 2021). Generally, GCL aims at training a graph encoder that maximizes the mutual information between instances with similar semantic information via augmentation.

Most existing works assume the pre-text graph and downstream graph are independent and identically distributed (IID)(Zhu et al., 2020, 2021). However, the graph in the downstream task often exhibits an out-of-distribution (OOD) pattern compared to that encountered in pre-text task(Ding et al., 2021; Wu et al., 2022b; Li et al., 2022; Zhang et al., 2021, 2022a; Miao et al., 2022; Wu et al., 2022a; Chen et al., 2023). Furthermore, we find that current methods perform poorly on the OOD downstream graph than IID ones, as shown on the left side of  Fig. 1.

Refer to caption
Refer to caption
Figure 1. Left: Traditional GCLs perform badly under OOD scenario compared to IID one. Right: Pairwize-Domain-Discrepancy grows during GCL.

To delve into the phenomenon mentioned above, we utilize pairwise domain discrepancy (PDD), which is widely used in prior works(Tong et al., 2023; Muandet et al., 2013; Hu et al., 2020; Li et al., 2018) to measure the model’s OOD generalization capability. PDD describes the average distance between domain centers in the embedding space. As shown on the right side of  Fig. 1, PDD gradually increases during GCL training, aligning with the declined performance under the OOD scenario. Through in-depth analysis (details in Sec. 3.1), we argue that the model’s reduced generalization capability stems from treating cross-domain pair as a negative sample solely in the traditional GCL paradigm. By aiming to reduce negative sample similarity in InfoNCE(Oord et al., 2018), domains are pushed further apart, resulting in increased PDD and poor OOD generalization performance.

Motivated by the above analysis, we propose Negative as Positive, namely NaP, to enhance the OOD generalization of GCL. Specifically, considering that the embedding of nodes represents its semantics, NaP dynamically transfers a subset of cross-domain negative samples as positive samples based on the embedding similarity, and reduces the distance of positive samples. Therefore, NaP can narrow the distribution gap among embedding from different domains, further preserving domain-shared knowledge and enhancing OOD generalization. Extensive experiments on various datasets and tasks demonstrate the improved domain generalization capability of the proposed method compared to the SOTA GCL methods.

2. Preliminaries

2.1. Task Formulation of OOD in GCL

Let 𝒢=(𝐗,𝐀)𝒢𝐗𝐀\mathcal{G}=(\mathbf{X,A})caligraphic_G = ( bold_X , bold_A ) denote a graph, where 𝐗N×F𝐗superscript𝑁𝐹\mathbf{X}\in\mathbb{R}^{N\times F}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_F end_POSTSUPERSCRIPT denotes the nodes’ feature map, and 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the feature of node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. 𝐀N×N𝐀superscript𝑁𝑁\mathbf{A}\in\mathbb{R}^{N\times N}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT denotes the adjacency matrix, where 𝐀ij=1subscript𝐀𝑖𝑗1\mathbf{A}_{ij}=1bold_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 means visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are connected. As Eq. 1 shows, GCL aims at training a GNN encoder(Wu et al., 2019; Kipf and Welling, 2016; Veličković et al., 2017; Xu et al., 2018) gθ(𝒢)subscript𝑔𝜃𝒢g_{\theta}(\mathcal{G})italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_G ) by maximizing the mutual information between instances with similar semantic information via augmentation. The augmented graph is noted as 𝒢ψsubscript𝒢𝜓{\mathcal{G}}_{\psi}caligraphic_G start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, where ψ𝜓\psiitalic_ψ represents one kind of augmentation method such as used in (Feng et al., 2020; Zhu et al., 2020, 2021; Hou et al., 2022),

(1) θ=maxθ(gθ(𝒢α),gθ(𝒢β))superscript𝜃subscript𝜃subscript𝑔𝜃subscript𝒢𝛼subscript𝑔𝜃subscript𝒢𝛽\theta^{*}=\max_{\theta}\mathcal{I}(g_{\theta}({\mathcal{G}_{\alpha}}),g_{% \theta}({\mathcal{G}_{\beta}}))italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_I ( italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) , italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) )

The formulation of OOD in GCL is as follows: θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in Eq. 1 is optimized on data {(Gi)|i=1S}evaluated-atsuperscript𝐺𝑖𝑖1𝑆\{(G^{i})|_{i=1}^{S}\}{ ( italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) | start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT }, and leveraged to infer GTsuperscript𝐺𝑇G^{T}italic_G start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, with P(GT)P((Gi)|i=1S)𝑃superscript𝐺𝑇𝑃evaluated-atsuperscript𝐺𝑖𝑖1𝑆P(G^{T})\neq P((G^{i})|_{i=1}^{S})italic_P ( italic_G start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ≠ italic_P ( ( italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) | start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ), where S𝑆Sitalic_S is the number of domains in pre-training. In contrast, within IID scenarios, P(GT)=P((Gi)|i=1S)𝑃superscript𝐺𝑇𝑃evaluated-atsuperscript𝐺𝑖𝑖1𝑆P(G^{T})=P((G^{i})|_{i=1}^{S})italic_P ( italic_G start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) = italic_P ( ( italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) | start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ). Fig.1 shows the test accuracy for OOD and IID scenarios of a representative benchmark GOOD-Twitch, where each graph Gisuperscript𝐺𝑖G^{i}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is a gamer network and different domains represent the different languages used in the network. All three GCL methods(Zhu et al., 2020; Zhang et al., 2022b; Zhu et al., 2021) exhibit significant performance degradation in the presence of OOD, emphasizing the critical importance of investigating this phenomenon.

2.2. Pairwise Domain Discrepancy

Pairwise domain discrepancy(PDD) is widely used to measure the model’s OOD generalization capability in prior works(Tong et al., 2023; Muandet et al., 2013; Hu et al., 2020; Li et al., 2018). It’s the average distance among all pairs of the domains’ centers. Denote the center embedding of domain d𝑑ditalic_d as hd¯=1Ndi=1Nd𝐇id¯superscript𝑑1subscript𝑁𝑑superscriptsubscript𝑖1subscript𝑁𝑑superscriptsubscript𝐇𝑖𝑑\bar{h^{d}}=\frac{1}{N_{d}}\sum_{i=1}^{N_{d}}\mathbf{H}_{i}^{d}over¯ start_ARG italic_h start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and PDD is as follows:

(2) PDD=1(P2)p,q|1p<qPhp¯hq¯,𝑃𝐷𝐷1binomial𝑃2subscript𝑝evaluated-at𝑞1𝑝𝑞𝑃norm¯superscript𝑝¯superscript𝑞PDD=\frac{1}{\binom{P}{2}}\sum_{p,q|_{1\leq p<q\leq P}}\|\bar{h^{p}}-\bar{h^{q% }}\|,italic_P italic_D italic_D = divide start_ARG 1 end_ARG start_ARG ( FRACOP start_ARG italic_P end_ARG start_ARG 2 end_ARG ) end_ARG ∑ start_POSTSUBSCRIPT italic_p , italic_q | start_POSTSUBSCRIPT 1 ≤ italic_p < italic_q ≤ italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_h start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_ARG - over¯ start_ARG italic_h start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT end_ARG ∥ ,

where P denotes the number of domains, 𝐇idsuperscriptsubscript𝐇𝑖𝑑\mathbf{H}_{i}^{d}bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes the embedding of i𝑖iitalic_i-th node in domain d𝑑ditalic_d and Ndsubscript𝑁𝑑{N_{d}}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT denotes the number of nodes in domain d𝑑ditalic_d.

3. Proposed Method

Refer to caption
Refer to caption
Figure 2. Left: All CDPs are negative samples. Right: PDD decreases while more CDPs are removed.

In this section, we first show the motivation of NaP and then introduce each part of NaP in detail.

3.1. Motivation

The phenomenon of OOD is highly prevalent in GCL, which underscores the need to address OOD issues. Taking one common scenario as an example: in social networks, GCL may be trained on highly influential communities but applied to low-influence users (Bi et al., 2023b). This phenomenon is also common in areas such as financial risk prediction(Bi et al., 2023a) (high-market-value companies VS medium-sized ones) and fraudulent accounts detection (old fraudulent style VS new ones). Such commonality highlights the critical need to address OOD in GCL. However, as shown in Fig. 1, the traditional GCLs perform poorly on OOD scenarios, and the PDD of all domains continues to increase during the training of GCLs. The increasing PDD indicates that GCL will widen the gap in domain distribution and push domains further apart, violating an ideal OOD generalization, which should capture the shared knowledge among different domains and facilitate the seamless transfer to unseen target domains.

Let Cross-Domain Pair (CDP) represent two nodes from different domains. We argue that the principal constituents of negative samples for optimizing Eq.  1 are CDPs, being a significant factor in the poor OOD generalization capability. Specifically, as shown on the left side of Fig.2, CDPs can only be negative samples, and the traditional contrastive loss will decrease the similarity of negative samples, leading to the pushing-apart effect between the nodes in CDP. Furthermore, as shown on the right side of Fig. 2, the PDD of node embedding of GCL decreases as the ratio of removed CDP increases which proves that CDPs are harmful to GCL’s OOD generalization. Therefore, the CDPs in traditional GCL tend to push the representations of samples from different domains apart, resulting in a higher PDD and a poor OOD generalization ability.

3.2. NaP: Negative as Positive

Based on the above motivation, we propose NaP, which transfers a subset of the most semantically similar negative samples as positive ones. Fig.3 illustrates the overall framework of NaP, including the encoding module and the objective module. Note that our NaP framework can be adapted to existing GCL methods that use InfoNCE as loss function, e.g., GRACE(Zhu et al., 2020), GCA(Zhu et al., 2021), and GraphCL(Hafidi et al., [n. d.]).

Refer to caption
Figure 3. The overall framework of NaP consists of two modules: the encoding module and the objective module. The objective module comprises two stages: the warm-up stage and the NaP stage.

3.2.1. Encoding Module

The objective of this module is to obtain the embedding of each node. We first generate different views of 𝒢𝒢\mathcal{G}caligraphic_G as 𝒢α~~subscript𝒢𝛼\tilde{\mathcal{G}_{\alpha}}over~ start_ARG caligraphic_G start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG, 𝒢β~~subscript𝒢𝛽\tilde{\mathcal{G}_{\beta}}over~ start_ARG caligraphic_G start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_ARG using graph augmentations. And input the augmented graphs into a shared GCN(Kipf and Welling, 2016) encoder to get the embedding 𝐇αsubscript𝐇𝛼\mathbf{H}_{\alpha}bold_H start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, 𝐇βsubscript𝐇𝛽\mathbf{H}_{\beta}bold_H start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT. The propagation of the l𝑙litalic_l-th layer of GCN is represented as:

(3) 𝐇l+1=σ(𝐃~12𝐀~𝐃~12HlWl),superscript𝐇𝑙1𝜎superscript~𝐃12~𝐀superscript~𝐃12superscriptH𝑙superscriptW𝑙\mathbf{H}^{l+1}=\sigma(\mathbf{\tilde{D}}^{-\frac{1}{2}}\mathbf{\tilde{A}}% \mathbf{\tilde{D}}^{-\frac{1}{2}}\textbf{H}^{l}\textbf{W}^{l}),bold_H start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = italic_σ ( over~ start_ARG bold_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over~ start_ARG bold_A end_ARG over~ start_ARG bold_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT H start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ,

where σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is the activation function, 𝐀~~𝐀\mathbf{\tilde{A}}over~ start_ARG bold_A end_ARG is the adjacency matrix with self-loop, 𝐃~~𝐃\mathbf{\tilde{D}}over~ start_ARG bold_D end_ARG is the corresponding degree matrix and W is the parameter matrix.

3.2.2. Objective Module

Considering that the representations obtained from randomly initialized models may not accurately reflect the semantic information of the samples, we have to train the GCL in the traditional way for several epochs. Therefore, there are two stages in this module: Warm-up stage and NaP stage.

(1) Warm-Up Stage:

Firstly, we use the traditional InfoNCE loss to train the GCL as the warm-up for the NaP stage. The InfoNCE loss for each positive pair (vαi,vβi)subscript𝑣𝛼𝑖subscript𝑣𝛽𝑖(v_{\alpha i},v_{\beta i})( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_i end_POSTSUBSCRIPT ) in warm-up stage is:

(4) w=logexp(θ(vαi,vβi)τ)exp(θ(vαi,vβi)τ)+jiexp(θ(vαi,vβj)τ)+jiexp(θ(vαi,vαj)τ)subscript𝑤𝜃subscript𝑣𝛼𝑖subscript𝑣𝛽𝑖𝜏𝜃subscript𝑣𝛼𝑖subscript𝑣𝛽𝑖𝜏subscript𝑗𝑖𝜃subscript𝑣𝛼𝑖subscript𝑣𝛽𝑗𝜏subscript𝑗𝑖𝜃subscript𝑣𝛼𝑖subscript𝑣𝛼𝑗𝜏\footnotesize\mathcal{L}_{w}=-\log{\frac{\exp(\frac{\theta(v_{\alpha i},v_{% \beta i})}{\tau})}{\exp(\frac{\theta(v_{\alpha i},v_{\beta i})}{\tau})+\sum_{j% \neq i}\exp({\frac{\theta(v_{\alpha i},v_{\beta j})}{\tau}})+\sum_{j\neq i}% \exp({\frac{\theta(v_{\alpha i},v_{\alpha j})}{\tau}})}}caligraphic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = - roman_log divide start_ARG roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) end_ARG start_ARG roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) + ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) + ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_α italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) end_ARG

The θ(vαi,vβj)𝜃subscript𝑣𝛼𝑖subscript𝑣𝛽𝑗\theta(v_{\alpha i},v_{\beta j})italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_j end_POSTSUBSCRIPT ) means cosine similarity between 𝐇αisubscript𝐇𝛼𝑖\mathbf{H}_{\alpha i}bold_H start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT, 𝐇βjsubscript𝐇𝛽𝑗\mathbf{H}_{\beta j}bold_H start_POSTSUBSCRIPT italic_β italic_j end_POSTSUBSCRIPT.

(2) NaP Stage:

After n epochs warm-up, we enter the NaP stage where a subset of CDPs is chosen to transform into positive samples to mitigate the domain discrepancies introduced by CDPs. We select the most similar CDPs based on the between-view embedding similarity in the current epoch and transform the chosen CDPs into positive samples by adding a new loss item. Firstly, we compute the between-view-similarity matrix:

(5) 𝐁=𝐇α𝐇βT𝐁subscript𝐇𝛼superscriptsubscript𝐇𝛽𝑇\mathbf{B}=\mathbf{H}_{\alpha}\mathbf{H}_{\beta}^{T}bold_B = bold_H start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT

We focus our attention on cross-domain samples, so we update 𝐁𝐁\mathbf{B}bold_B as follows:

(6) 𝐁ij=0 if di=djsubscript𝐁𝑖𝑗0 if subscript𝑑𝑖subscript𝑑𝑗\mathbf{B}_{ij}=0\text{ if }d_{i}=d_{j}bold_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 if italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

The disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT means the domain index of vi,i{1,2,,N}subscript𝑣𝑖𝑖12𝑁v_{i},i\in\{1,2,...,N\}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , 2 , … , italic_N }. After sorting the elements in 𝐁𝐁\mathbf{B}bold_B, we can select the top r𝑟ritalic_r of most similar samples and their indices idx𝑖𝑑𝑥idxitalic_i italic_d italic_x as follows:

(7) idx=argmaxIN×N:|I|=r(i,j)I𝐁ijidxsubscript:𝐼superscript𝑁𝑁𝐼𝑟subscript𝑖𝑗𝐼subscript𝐁𝑖𝑗\mathrm{idx}=\arg\max_{I\subset\mathbb{R}^{N\times N}:|I|=r}\sum_{(i,j)\in I}% \mathbf{B}_{ij}roman_idx = roman_arg roman_max start_POSTSUBSCRIPT italic_I ⊂ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT : | italic_I | = italic_r end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ italic_I end_POSTSUBSCRIPT bold_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT

To obtain the transformed CDPs, we set the mask matrix:

(8) maskij=1 if (i,j)idx else 0𝑚𝑎𝑠subscript𝑘𝑖𝑗1 if 𝑖𝑗idx else 0mask_{ij}=1\text{ if }(i,j)\in\mathrm{idx}\text{ else }0italic_m italic_a italic_s italic_k start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 if ( italic_i , italic_j ) ∈ roman_idx else 0

Up to this point, only the top r𝑟ritalic_r most similar CDPs are retained in the mask. We add a new loss item to transform these CDPs into positive samples, namely NaPsubscript𝑁𝑎𝑃\mathcal{L}_{NaP}caligraphic_L start_POSTSUBSCRIPT italic_N italic_a italic_P end_POSTSUBSCRIPT:

(9) NaP=logjimaskij{exp(θ(vαi,vβj)τ)+exp(θ(vαi,vαj)τ)}exp(θ(vαi,vβi)τ)+jiexp(θ(vαi,vβj)τ)+jiexp(θ(vαi,vαj)τ)subscript𝑁𝑎𝑃subscript𝑗𝑖𝑚𝑎𝑠subscript𝑘𝑖𝑗𝜃subscript𝑣𝛼𝑖subscript𝑣𝛽𝑗𝜏𝜃subscript𝑣𝛼𝑖subscript𝑣𝛼𝑗𝜏𝜃subscript𝑣𝛼𝑖subscript𝑣𝛽𝑖𝜏subscript𝑗𝑖𝜃subscript𝑣𝛼𝑖subscript𝑣𝛽𝑗𝜏subscript𝑗𝑖𝜃subscript𝑣𝛼𝑖subscript𝑣𝛼𝑗𝜏\footnotesize\mathcal{L}_{NaP}=-\log{\frac{\sum_{j\neq i}mask_{ij}\{\exp({% \frac{\theta(v_{\alpha i},v_{\beta j})}{\tau}})+\exp({\frac{\theta(v_{\alpha i% },v_{\alpha j})}{\tau}})\}}{\exp({\frac{\theta(v_{\alpha i},v_{\beta i})}{\tau% }})+\sum_{j\neq i}\exp({\frac{\theta(v_{\alpha i},v_{\beta j})}{\tau}})+\sum_{% j\neq i}\exp({\frac{\theta(v_{\alpha i},v_{\alpha j})}{\tau}})}}caligraphic_L start_POSTSUBSCRIPT italic_N italic_a italic_P end_POSTSUBSCRIPT = - roman_log divide start_ARG ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_m italic_a italic_s italic_k start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT { roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) + roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_α italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) } end_ARG start_ARG roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) + ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) + ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_α italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) end_ARG

Finally, for each positive pair (vαi,vβi)subscript𝑣𝛼𝑖subscript𝑣𝛽𝑖(v_{\alpha i},v_{\beta i})( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_i end_POSTSUBSCRIPT ), the loss in NaP stage is written as below:

(10) =NaP+w=logexp(θ(vαi,vβi)τ)+jimaskij{exp(θ(vαi,vβj)τ)+exp(θ(vαi,vαj)τ)}exp(θ(vαi,vβi)τ)+jiexp(θ(vαi,vβj)τ)+jiexp(θ(vαi,vαj)τ)subscript𝑁𝑎𝑃subscript𝑤𝜃subscript𝑣𝛼𝑖subscript𝑣𝛽𝑖𝜏subscript𝑗𝑖𝑚𝑎𝑠subscript𝑘𝑖𝑗𝜃subscript𝑣𝛼𝑖subscript𝑣𝛽𝑗𝜏𝜃subscript𝑣𝛼𝑖subscript𝑣𝛼𝑗𝜏𝜃subscript𝑣𝛼𝑖subscript𝑣𝛽𝑖𝜏subscript𝑗𝑖𝜃subscript𝑣𝛼𝑖subscript𝑣𝛽𝑗𝜏subscript𝑗𝑖𝜃subscript𝑣𝛼𝑖subscript𝑣𝛼𝑗𝜏\footnotesize\begin{split}\mathcal{L}&=\mathcal{L}_{NaP}+\mathcal{L}_{w}\\ &=-\log{\frac{\exp(\frac{\theta(v_{\alpha i},v_{\beta i})}{\tau})+\sum_{j\neq i% }mask_{ij}\{\exp({\frac{\theta(v_{\alpha i},v_{\beta j})}{\tau}})+\exp({\frac{% \theta(v_{\alpha i},v_{\alpha j})}{\tau}})\}}{\exp(\frac{\theta(v_{\alpha i},v% _{\beta i})}{\tau})+\sum_{j\neq i}\exp({\frac{\theta(v_{\alpha i},v_{\beta j})% }{\tau}})+\sum_{j\neq i}\exp({\frac{\theta(v_{\alpha i},v_{\alpha j})}{\tau}})% }}\end{split}start_ROW start_CELL caligraphic_L end_CELL start_CELL = caligraphic_L start_POSTSUBSCRIPT italic_N italic_a italic_P end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - roman_log divide start_ARG roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) + ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_m italic_a italic_s italic_k start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT { roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) + roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_α italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) } end_ARG start_ARG roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) + ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_β italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) + ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT roman_exp ( divide start_ARG italic_θ ( italic_v start_POSTSUBSCRIPT italic_α italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_α italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) end_ARG end_CELL end_ROW

To sum up, after n epochs of training according to the loss in Eq. 4, NaP selects the top r most similar CDPs based on the current epoch’s embedding similarity. These CDPs are then treated as positive samples, and the training continues using the loss described in Eq. 10.

4. Experiments

In this section, we empirically evaluate the quality of produced node embedding on node classification using two public benchmark datasets: GOOD benchmark and Facebook100.

4.1. Datasets

We use 3 datasets from GOOD benchmark(Gui et al., 2022) and 15 datasets from Facebook100(Traud et al., 2012) for experiments. Datasets from Facebook100 are social networks of 100 universities in the US. Each university is viewed as a domain and each node stands for a student or faculty.

Table 1. Experiments results of all baselines and NaP. The bold font represents the top-1 performance and the underline represents the second performance across the self-supervised methods.

Facebook100 GOOD benchmark Dataset Santa Wake Bucknell Colgate Wesleyan Twitch CBAS Cora Domain university language color degree DGI 87.08% 83.02% 89.24% 89.55% 88.52% 53.34% 52.86% 46.61% GRACE 87.88% 82.70% 90.12% 82.09% 90.80% 58.00% 48.10% 50.85% GCA 89.10% 82.71% 93.01% 91.18% 90.11% 60.14% 50.00% 50.97% COSTA 89.93% 75.29% 91.46% 91.52% 88.36% 49.40% 45.24% 48.09% BGRL 88.80% 83.61% 91.59% 85.18% 82.41% 63.25% 49.05% 40.63% MVGRL 90.12% 78.58% 91.45% 88.38% 90.13% 53.98% 50.95% 47.15% Ours 91.06% 86.55% 93.26% 93.18% 91.51% 61.08% 53.33% 51.31% improve +0.94% +2.94% +0.25% +1.66% +0.71% -2.17% +0.47% +0.34% GCN 92.10% 87.14% 94.47% 93.24% 92.10% 51.65% 65.24% 59.39%

4.2. Experimental Setup

4.2.1. Data settings

We divide the dataset according to GOOD(Gui et al., 2022). Specifically, for the Facebook100, we randomly use 9 domains as the source domains for training, 1 domain (Emory) for validation, and 15 others for testing.

4.2.2. Model and Metric settings

We use 6 contrastive methods: DGI, GRACE, GCA, COSTA, BGRL, MVGRL(Veličković et al., 2018; Zhu et al., 2020, 2021; Thakoor et al., 2021; Zhang et al., 2022b; Hassani and Khasahmadi, 2020) for self-supervised methods, and use GCN(Kipf and Welling, 2016) as supervised baselines. The checkpoint for OOD testing is decided based on the result obtained from OOD validation domains. The reported results represent the average accuracy from three independent runs.

4.2.3. Results and Analysis

(1) NaP surpasses baselines

As shown in the Table.1, NaP outperforms almost all GCL baselines. It is worth noting that NaP surpasses all four baselines - DGI, GRACE, GCA and COSTA(Veličković et al., 2018; Zhu et al., 2020, 2021; Zhang et al., 2022b) - that use InfoNCE loss, with an improvement of up to 11.68%. Furthermore, NaP outperforms BGRL, which uses BYOL(Thakoor et al., 2021) as the loss function, and MVGRL, which uses JSD(Hassani and Khasahmadi, 2020) as the loss function, on the majority of datasets. Last but not least, compared to GCN(Kipf and Welling, 2016), NaP has a relatively good performance considering we use significantly fewer labels.

Refer to caption
Figure 4. Experiments result of NaP and GRACE on 10 OOD target domains from Facebook100.
(2) NaP’s strategy is highly effective.

As shown in Fig.4, NaP achieves higher accuracy on 10 additional domains. Since this experiment utilized GRACE as a warm-up stage, NaP’s superior OOD generalization ability demonstrates the effectiveness of the proposed strategy in this paper.

Refer to caption
Figure 5. t-SNE visualization and PDD of node embedding.
(3) NaP narrows the distance between domains

As shown in Fig. 5, compared to GRACE, the embedding obtained by NaP exhibits a smaller PDD. More importantly, as the PDD decreases, the node distributions between different domains with the same label become closer.

Table 2. The similarity comparison of different CDPs.
Input Feature Embedding
All CDPs 0.0015 0.0199
Transformed CDPs 0.0282 0.8523
Other CDPs -0.0010 -0.0891
(4) The CDPs transformed by NaP exhibit semantic similarity in the input space.

As shown in Table.2 the cosine similarity of all transformed CDPs is significantly higher than that of all CDPs and the remaining CDPs. This demonstrates that NaP indeed transforms the most semantically similar CDPs into positive samples.

5. Conclusion

In this work, we investigate the OOD generalization capability of traditional graph contrastive learning methods. We argue that cross-domain pairs (CDPs) make the domains distribution shift larger and hinder the model’s OOD generalization capability. Based on this, we propose to transfer the most semantically similar CDPs as positive samples. Comprehensive experiments show that our method NaP significantly benefits the OOD generalization capability of graph contrastive learning methods.

6. Acknowledgement

This work was supported by the National Natural Science Foundation of China (Grant No.U21B2046, No.62202448), the Strategic Priority Research Program of the CAS under Grants No. XDB0680302.

References

  • (1)
  • Bi et al. (2023a) Wendong Bi, Xueqi Cheng, Bingbing Xu, Xiaoqian Sun, Li Xu, and Huawei Shen. 2023a. Bridged-gnn: Knowledge bridge learning for effective knowledge transfer. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 99–109.
  • Bi et al. (2023b) Wendong Bi, Bingbing Xu, Xiaoqian Sun, Li Xu, Huawei Shen, and Xueqi Cheng. 2023b. Predicting the silent majority on graphs: Knowledge transferable graph neural network. In Proceedings of the ACM Web Conference 2023. 274–285.
  • Chen et al. (2023) Guoxin Chen, Yongqing Wang, Fangda Guo, Qinglang Guo, Jiangli Shao, Huawei Shen, and Xueqi Cheng. 2023. Causality and independence enhancement for biased node classification. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 203–212.
  • Ding et al. (2021) Mucong Ding, Kezhi Kong, Jiuhai Chen, John Kirchenbauer, Micah Goldblum, David Wipf, Furong Huang, and Tom Goldstein. 2021. A closer look at distribution shifts and out-of-distribution generalization on graphs. (2021).
  • Feng et al. (2020) Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie Tang. 2020. Graph random neural networks for semi-supervised learning on graphs. Advances in neural information processing systems 33 (2020), 22092–22103.
  • Gui et al. (2022) Shurui Gui, Xiner Li, Limei Wang, and Shuiwang Ji. 2022. Good: A graph out-of-distribution benchmark. Advances in Neural Information Processing Systems 35 (2022), 2059–2073.
  • Hafidi et al. ([n. d.]) H Hafidi, M Ghogho, P Ciblat, and A Swami. [n. d.]. Graphcl: Contrastive self-supervised learning of graph representations. arXiv 2020. arXiv preprint arXiv:2007.08025 ([n. d.]).
  • Hassani and Khasahmadi (2020) Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view representation learning on graphs. In International conference on machine learning. PMLR, 4116–4126.
  • Hou et al. (2022) Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. 2022. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 594–604.
  • Hu et al. (2020) Shoubo Hu, Kun Zhang, Zhitang Chen, and Laiwan Chan. 2020. Domain generalization via multidomain discriminant analysis. In Uncertainty in Artificial Intelligence. PMLR, 292–302.
  • Jin et al. (2021) Ming Jin, Yizhen Zheng, Yuan-Fang Li, Chen Gong, Chuan Zhou, and Shirui Pan. 2021. Multi-scale contrastive siamese networks for self-supervised graph representation learning. arXiv preprint arXiv:2105.05682 (2021).
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Li et al. (2022) Haoyang Li, Ziwei Zhang, Xin Wang, and Wenwu Zhu. 2022. Learning invariant graph representations for out-of-distribution generalization. Advances in Neural Information Processing Systems 35 (2022), 11828–11841.
  • Li et al. (2018) Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. 2018. Domain generalization via conditional invariant representations. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
  • Mavromatis and Karypis (2021) Costas Mavromatis and George Karypis. 2021. Graph infoclust: Maximizing coarse-grain mutual information in graphs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 541–553.
  • Miao et al. (2022) Siqi Miao, Mia Liu, and Pan Li. 2022. Interpretable and generalizable graph learning via stochastic attention mechanism. In International Conference on Machine Learning. PMLR, 15524–15543.
  • Muandet et al. (2013) Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. 2013. Domain generalization via invariant feature representation. In International conference on machine learning. PMLR, 10–18.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
  • Qiu et al. (2020) Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1150–1160.
  • Thakoor et al. (2021) Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Mehdi Azabou, Eva L Dyer, Remi Munos, Petar Veličković, and Michal Valko. 2021. Large-scale representation learning on graphs via bootstrapping. arXiv preprint arXiv:2102.06514 (2021).
  • Tong et al. (2023) Peifeng Tong, Wu Su, He Li, Jialin Ding, Zhan Haoxiang, and Song Xi Chen. 2023. Distribution free domain generalization. In International Conference on Machine Learning. PMLR, 34369–34378.
  • Traud et al. (2012) Amanda L Traud, Peter J Mucha, and Mason A Porter. 2012. Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391, 16 (2012), 4165–4180.
  • Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
  • Veličković et al. (2018) Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2018. Deep graph infomax. arXiv preprint arXiv:1809.10341 (2018).
  • Verma et al. (2021) Vikas Verma, Thang Luong, Kenji Kawaguchi, Hieu Pham, and Quoc Le. 2021. Towards domain-agnostic contrastive learning. In International Conference on Machine Learning. PMLR, 10530–10541.
  • Wang et al. (2021) Xiao Wang, Nian Liu, Hui Han, and Chuan Shi. 2021. Self-supervised heterogeneous graph neural network with co-contrastive learning. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 1726–1736.
  • Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying graph convolutional networks. In International conference on machine learning. PMLR, 6861–6871.
  • Wu et al. (2022b) Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. 2022b. Handling distribution shifts on graphs: An invariance perspective. arXiv preprint arXiv:2202.02466 (2022).
  • Wu et al. (2022a) Ying-Xin Wu, Xiang Wang, An Zhang, Xiangnan He, and Tat-Seng Chua. 2022a. Discovering invariant rationales for graph neural networks. arXiv preprint arXiv:2201.12872 (2022).
  • Xu et al. (2018) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
  • You et al. (2021) Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. 2021. Graph contrastive learning automated. In International Conference on Machine Learning. PMLR, 12121–12132.
  • Zeng and Xie (2021) Jiaqi Zeng and Pengtao Xie. 2021. Contrastive self-supervised learning for graph classification. In Proceedings of the AAAI conference on Artificial Intelligence, Vol. 35. 10824–10832.
  • Zhang et al. (2021) Shengyu Zhang, Kun Kuang, Jiezhong Qiu, Jin Yu, Zhou Zhao, Hongxia Yang, Zhongfei Zhang, and Fei Wu. 2021. Stable prediction on graphs with agnostic distribution shift. arXiv preprint arXiv:2110.03865 (2021).
  • Zhang et al. (2022b) Yifei Zhang, Hao Zhu, Zixing Song, Piotr Koniusz, and Irwin King. 2022b. COSTA: covariance-preserving feature augmentation for graph contrastive learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2524–2534.
  • Zhang et al. (2022a) Zeyang Zhang, Xin Wang, Ziwei Zhang, Haoyang Li, Zhou Qin, and Wenwu Zhu. 2022a. Dynamic graph neural networks under spatio-temporal distribution shift. Advances in Neural Information Processing Systems 35 (2022), 6074–6089.
  • Zhu et al. (2020) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2020. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131 (2020).
  • Zhu et al. (2021) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021. 2069–2080.