Negative as Positive: Enhancing Out-of-distribution Generalization for Graph Contrastive Learning
Abstract.
Graph contrastive learning (GCL), standing as the dominant paradigm in the realm of graph pre-training, has yielded considerable progress. Nonetheless, its capacity for out-of-distribution (OOD) generalization has been relatively underexplored. In this work, we point out that the traditional optimization of InfoNCE in GCL restricts the cross-domain pairs only to be negative samples, which inevitably enlarges the distribution gap between different domains. This violates the requirement of domain invariance under OOD scenario and consequently impairs the model’s OOD generalization performance. To address this issue, we propose a novel strategy “Negative as Positive”, where the most semantically similar cross-domain negative pairs are treated as positive during GCL. Our experimental results, spanning a wide array of datasets, confirm that this method substantially improves the OOD generalization performance of GCL.
1. Introduction
Graph Contrastive Learning (GCL) with supervised fine-tuning has emerged as the dominant paradigm for graph pre-training, exhibiting remarkable performance across diverse downstream tasks while requiring only a limited amount of labeled data(Zhu et al., 2020, 2021; Hafidi et al., [n. d.]; Thakoor et al., 2021; Zeng and Xie, 2021; Qiu et al., 2020; Wang et al., 2021; You et al., 2021; Verma et al., 2021; Mavromatis and Karypis, 2021; Jin et al., 2021). Generally, GCL aims at training a graph encoder that maximizes the mutual information between instances with similar semantic information via augmentation.
Most existing works assume the pre-text graph and downstream graph are independent and identically distributed (IID)(Zhu et al., 2020, 2021). However, the graph in the downstream task often exhibits an out-of-distribution (OOD) pattern compared to that encountered in pre-text task(Ding et al., 2021; Wu et al., 2022b; Li et al., 2022; Zhang et al., 2021, 2022a; Miao et al., 2022; Wu et al., 2022a; Chen et al., 2023). Furthermore, we find that current methods perform poorly on the OOD downstream graph than IID ones, as shown on the left side of Fig. 1.
To delve into the phenomenon mentioned above, we utilize pairwise domain discrepancy (PDD), which is widely used in prior works(Tong et al., 2023; Muandet et al., 2013; Hu et al., 2020; Li et al., 2018) to measure the model’s OOD generalization capability. PDD describes the average distance between domain centers in the embedding space. As shown on the right side of Fig. 1, PDD gradually increases during GCL training, aligning with the declined performance under the OOD scenario. Through in-depth analysis (details in Sec. 3.1), we argue that the model’s reduced generalization capability stems from treating cross-domain pair as a negative sample solely in the traditional GCL paradigm. By aiming to reduce negative sample similarity in InfoNCE(Oord et al., 2018), domains are pushed further apart, resulting in increased PDD and poor OOD generalization performance.
Motivated by the above analysis, we propose Negative as Positive, namely NaP, to enhance the OOD generalization of GCL. Specifically, considering that the embedding of nodes represents its semantics, NaP dynamically transfers a subset of cross-domain negative samples as positive samples based on the embedding similarity, and reduces the distance of positive samples. Therefore, NaP can narrow the distribution gap among embedding from different domains, further preserving domain-shared knowledge and enhancing OOD generalization. Extensive experiments on various datasets and tasks demonstrate the improved domain generalization capability of the proposed method compared to the SOTA GCL methods.
2. Preliminaries
2.1. Task Formulation of OOD in GCL
Let denote a graph, where denotes the nodes’ feature map, and is the feature of node . denotes the adjacency matrix, where means and are connected. As Eq. 1 shows, GCL aims at training a GNN encoder(Wu et al., 2019; Kipf and Welling, 2016; Veličković et al., 2017; Xu et al., 2018) by maximizing the mutual information between instances with similar semantic information via augmentation. The augmented graph is noted as , where represents one kind of augmentation method such as used in (Feng et al., 2020; Zhu et al., 2020, 2021; Hou et al., 2022),
(1) |
The formulation of OOD in GCL is as follows: in Eq. 1 is optimized on data , and leveraged to infer , with , where is the number of domains in pre-training. In contrast, within IID scenarios, . Fig.1 shows the test accuracy for OOD and IID scenarios of a representative benchmark GOOD-Twitch, where each graph is a gamer network and different domains represent the different languages used in the network. All three GCL methods(Zhu et al., 2020; Zhang et al., 2022b; Zhu et al., 2021) exhibit significant performance degradation in the presence of OOD, emphasizing the critical importance of investigating this phenomenon.
2.2. Pairwise Domain Discrepancy
Pairwise domain discrepancy(PDD) is widely used to measure the model’s OOD generalization capability in prior works(Tong et al., 2023; Muandet et al., 2013; Hu et al., 2020; Li et al., 2018). It’s the average distance among all pairs of the domains’ centers. Denote the center embedding of domain as , and PDD is as follows:
(2) |
where P denotes the number of domains, denotes the embedding of -th node in domain and denotes the number of nodes in domain .
3. Proposed Method
In this section, we first show the motivation of NaP and then introduce each part of NaP in detail.
3.1. Motivation
The phenomenon of OOD is highly prevalent in GCL, which underscores the need to address OOD issues. Taking one common scenario as an example: in social networks, GCL may be trained on highly influential communities but applied to low-influence users (Bi et al., 2023b). This phenomenon is also common in areas such as financial risk prediction(Bi et al., 2023a) (high-market-value companies VS medium-sized ones) and fraudulent accounts detection (old fraudulent style VS new ones). Such commonality highlights the critical need to address OOD in GCL. However, as shown in Fig. 1, the traditional GCLs perform poorly on OOD scenarios, and the PDD of all domains continues to increase during the training of GCLs. The increasing PDD indicates that GCL will widen the gap in domain distribution and push domains further apart, violating an ideal OOD generalization, which should capture the shared knowledge among different domains and facilitate the seamless transfer to unseen target domains.
Let Cross-Domain Pair (CDP) represent two nodes from different domains. We argue that the principal constituents of negative samples for optimizing Eq. 1 are CDPs, being a significant factor in the poor OOD generalization capability. Specifically, as shown on the left side of Fig.2, CDPs can only be negative samples, and the traditional contrastive loss will decrease the similarity of negative samples, leading to the pushing-apart effect between the nodes in CDP. Furthermore, as shown on the right side of Fig. 2, the PDD of node embedding of GCL decreases as the ratio of removed CDP increases which proves that CDPs are harmful to GCL’s OOD generalization. Therefore, the CDPs in traditional GCL tend to push the representations of samples from different domains apart, resulting in a higher PDD and a poor OOD generalization ability.
3.2. NaP: Negative as Positive
Based on the above motivation, we propose NaP, which transfers a subset of the most semantically similar negative samples as positive ones. Fig.3 illustrates the overall framework of NaP, including the encoding module and the objective module. Note that our NaP framework can be adapted to existing GCL methods that use InfoNCE as loss function, e.g., GRACE(Zhu et al., 2020), GCA(Zhu et al., 2021), and GraphCL(Hafidi et al., [n. d.]).
3.2.1. Encoding Module
The objective of this module is to obtain the embedding of each node. We first generate different views of as , using graph augmentations. And input the augmented graphs into a shared GCN(Kipf and Welling, 2016) encoder to get the embedding , . The propagation of the -th layer of GCN is represented as:
(3) |
where is the activation function, is the adjacency matrix with self-loop, is the corresponding degree matrix and W is the parameter matrix.
3.2.2. Objective Module
Considering that the representations obtained from randomly initialized models may not accurately reflect the semantic information of the samples, we have to train the GCL in the traditional way for several epochs. Therefore, there are two stages in this module: Warm-up stage and NaP stage.
(1) Warm-Up Stage:
Firstly, we use the traditional InfoNCE loss to train the GCL as the warm-up for the NaP stage. The InfoNCE loss for each positive pair in warm-up stage is:
(4) |
The means cosine similarity between , .
(2) NaP Stage:
After n epochs warm-up, we enter the NaP stage where a subset of CDPs is chosen to transform into positive samples to mitigate the domain discrepancies introduced by CDPs. We select the most similar CDPs based on the between-view embedding similarity in the current epoch and transform the chosen CDPs into positive samples by adding a new loss item. Firstly, we compute the between-view-similarity matrix:
(5) |
We focus our attention on cross-domain samples, so we update as follows:
(6) |
The means the domain index of . After sorting the elements in , we can select the top of most similar samples and their indices as follows:
(7) |
To obtain the transformed CDPs, we set the mask matrix:
(8) |
Up to this point, only the top most similar CDPs are retained in the mask. We add a new loss item to transform these CDPs into positive samples, namely :
(9) |
Finally, for each positive pair , the loss in NaP stage is written as below:
(10) |
4. Experiments
In this section, we empirically evaluate the quality of produced node embedding on node classification using two public benchmark datasets: GOOD benchmark and Facebook100.
4.1. Datasets
We use 3 datasets from GOOD benchmark(Gui et al., 2022) and 15 datasets from Facebook100(Traud et al., 2012) for experiments. Datasets from Facebook100 are social networks of 100 universities in the US. Each university is viewed as a domain and each node stands for a student or faculty.
Facebook100 GOOD benchmark Dataset Santa Wake Bucknell Colgate Wesleyan Twitch CBAS Cora Domain university language color degree DGI 87.08% 83.02% 89.24% 89.55% 88.52% 53.34% 52.86% 46.61% GRACE 87.88% 82.70% 90.12% 82.09% 90.80% 58.00% 48.10% 50.85% GCA 89.10% 82.71% 93.01% 91.18% 90.11% 60.14% 50.00% 50.97% COSTA 89.93% 75.29% 91.46% 91.52% 88.36% 49.40% 45.24% 48.09% BGRL 88.80% 83.61% 91.59% 85.18% 82.41% 63.25% 49.05% 40.63% MVGRL 90.12% 78.58% 91.45% 88.38% 90.13% 53.98% 50.95% 47.15% Ours 91.06% 86.55% 93.26% 93.18% 91.51% 61.08% 53.33% 51.31% improve +0.94% +2.94% +0.25% +1.66% +0.71% -2.17% +0.47% +0.34% GCN 92.10% 87.14% 94.47% 93.24% 92.10% 51.65% 65.24% 59.39%
4.2. Experimental Setup
4.2.1. Data settings
We divide the dataset according to GOOD(Gui et al., 2022). Specifically, for the Facebook100, we randomly use 9 domains as the source domains for training, 1 domain (Emory) for validation, and 15 others for testing.
4.2.2. Model and Metric settings
We use 6 contrastive methods: DGI, GRACE, GCA, COSTA, BGRL, MVGRL(Veličković et al., 2018; Zhu et al., 2020, 2021; Thakoor et al., 2021; Zhang et al., 2022b; Hassani and Khasahmadi, 2020) for self-supervised methods, and use GCN(Kipf and Welling, 2016) as supervised baselines. The checkpoint for OOD testing is decided based on the result obtained from OOD validation domains. The reported results represent the average accuracy from three independent runs.
4.2.3. Results and Analysis
(1) NaP surpasses baselines
As shown in the Table.1, NaP outperforms almost all GCL baselines. It is worth noting that NaP surpasses all four baselines - DGI, GRACE, GCA and COSTA(Veličković et al., 2018; Zhu et al., 2020, 2021; Zhang et al., 2022b) - that use InfoNCE loss, with an improvement of up to 11.68%. Furthermore, NaP outperforms BGRL, which uses BYOL(Thakoor et al., 2021) as the loss function, and MVGRL, which uses JSD(Hassani and Khasahmadi, 2020) as the loss function, on the majority of datasets. Last but not least, compared to GCN(Kipf and Welling, 2016), NaP has a relatively good performance considering we use significantly fewer labels.
(2) NaP’s strategy is highly effective.
As shown in Fig.4, NaP achieves higher accuracy on 10 additional domains. Since this experiment utilized GRACE as a warm-up stage, NaP’s superior OOD generalization ability demonstrates the effectiveness of the proposed strategy in this paper.
(3) NaP narrows the distance between domains
As shown in Fig. 5, compared to GRACE, the embedding obtained by NaP exhibits a smaller PDD. More importantly, as the PDD decreases, the node distributions between different domains with the same label become closer.
Input Feature | Embedding | |
---|---|---|
All CDPs | 0.0015 | 0.0199 |
Transformed CDPs | 0.0282 | 0.8523 |
Other CDPs | -0.0010 | -0.0891 |
(4) The CDPs transformed by NaP exhibit semantic similarity in the input space.
As shown in Table.2 the cosine similarity of all transformed CDPs is significantly higher than that of all CDPs and the remaining CDPs. This demonstrates that NaP indeed transforms the most semantically similar CDPs into positive samples.
5. Conclusion
In this work, we investigate the OOD generalization capability of traditional graph contrastive learning methods. We argue that cross-domain pairs (CDPs) make the domains distribution shift larger and hinder the model’s OOD generalization capability. Based on this, we propose to transfer the most semantically similar CDPs as positive samples. Comprehensive experiments show that our method NaP significantly benefits the OOD generalization capability of graph contrastive learning methods.
6. Acknowledgement
This work was supported by the National Natural Science Foundation of China (Grant No.U21B2046, No.62202448), the Strategic Priority Research Program of the CAS under Grants No. XDB0680302.
References
- (1)
- Bi et al. (2023a) Wendong Bi, Xueqi Cheng, Bingbing Xu, Xiaoqian Sun, Li Xu, and Huawei Shen. 2023a. Bridged-gnn: Knowledge bridge learning for effective knowledge transfer. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 99–109.
- Bi et al. (2023b) Wendong Bi, Bingbing Xu, Xiaoqian Sun, Li Xu, Huawei Shen, and Xueqi Cheng. 2023b. Predicting the silent majority on graphs: Knowledge transferable graph neural network. In Proceedings of the ACM Web Conference 2023. 274–285.
- Chen et al. (2023) Guoxin Chen, Yongqing Wang, Fangda Guo, Qinglang Guo, Jiangli Shao, Huawei Shen, and Xueqi Cheng. 2023. Causality and independence enhancement for biased node classification. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 203–212.
- Ding et al. (2021) Mucong Ding, Kezhi Kong, Jiuhai Chen, John Kirchenbauer, Micah Goldblum, David Wipf, Furong Huang, and Tom Goldstein. 2021. A closer look at distribution shifts and out-of-distribution generalization on graphs. (2021).
- Feng et al. (2020) Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie Tang. 2020. Graph random neural networks for semi-supervised learning on graphs. Advances in neural information processing systems 33 (2020), 22092–22103.
- Gui et al. (2022) Shurui Gui, Xiner Li, Limei Wang, and Shuiwang Ji. 2022. Good: A graph out-of-distribution benchmark. Advances in Neural Information Processing Systems 35 (2022), 2059–2073.
- Hafidi et al. ([n. d.]) H Hafidi, M Ghogho, P Ciblat, and A Swami. [n. d.]. Graphcl: Contrastive self-supervised learning of graph representations. arXiv 2020. arXiv preprint arXiv:2007.08025 ([n. d.]).
- Hassani and Khasahmadi (2020) Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view representation learning on graphs. In International conference on machine learning. PMLR, 4116–4126.
- Hou et al. (2022) Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. 2022. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 594–604.
- Hu et al. (2020) Shoubo Hu, Kun Zhang, Zhitang Chen, and Laiwan Chan. 2020. Domain generalization via multidomain discriminant analysis. In Uncertainty in Artificial Intelligence. PMLR, 292–302.
- Jin et al. (2021) Ming Jin, Yizhen Zheng, Yuan-Fang Li, Chen Gong, Chuan Zhou, and Shirui Pan. 2021. Multi-scale contrastive siamese networks for self-supervised graph representation learning. arXiv preprint arXiv:2105.05682 (2021).
- Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
- Li et al. (2022) Haoyang Li, Ziwei Zhang, Xin Wang, and Wenwu Zhu. 2022. Learning invariant graph representations for out-of-distribution generalization. Advances in Neural Information Processing Systems 35 (2022), 11828–11841.
- Li et al. (2018) Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. 2018. Domain generalization via conditional invariant representations. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
- Mavromatis and Karypis (2021) Costas Mavromatis and George Karypis. 2021. Graph infoclust: Maximizing coarse-grain mutual information in graphs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 541–553.
- Miao et al. (2022) Siqi Miao, Mia Liu, and Pan Li. 2022. Interpretable and generalizable graph learning via stochastic attention mechanism. In International Conference on Machine Learning. PMLR, 15524–15543.
- Muandet et al. (2013) Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. 2013. Domain generalization via invariant feature representation. In International conference on machine learning. PMLR, 10–18.
- Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
- Qiu et al. (2020) Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1150–1160.
- Thakoor et al. (2021) Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Mehdi Azabou, Eva L Dyer, Remi Munos, Petar Veličković, and Michal Valko. 2021. Large-scale representation learning on graphs via bootstrapping. arXiv preprint arXiv:2102.06514 (2021).
- Tong et al. (2023) Peifeng Tong, Wu Su, He Li, Jialin Ding, Zhan Haoxiang, and Song Xi Chen. 2023. Distribution free domain generalization. In International Conference on Machine Learning. PMLR, 34369–34378.
- Traud et al. (2012) Amanda L Traud, Peter J Mucha, and Mason A Porter. 2012. Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391, 16 (2012), 4165–4180.
- Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
- Veličković et al. (2018) Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2018. Deep graph infomax. arXiv preprint arXiv:1809.10341 (2018).
- Verma et al. (2021) Vikas Verma, Thang Luong, Kenji Kawaguchi, Hieu Pham, and Quoc Le. 2021. Towards domain-agnostic contrastive learning. In International Conference on Machine Learning. PMLR, 10530–10541.
- Wang et al. (2021) Xiao Wang, Nian Liu, Hui Han, and Chuan Shi. 2021. Self-supervised heterogeneous graph neural network with co-contrastive learning. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 1726–1736.
- Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying graph convolutional networks. In International conference on machine learning. PMLR, 6861–6871.
- Wu et al. (2022b) Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. 2022b. Handling distribution shifts on graphs: An invariance perspective. arXiv preprint arXiv:2202.02466 (2022).
- Wu et al. (2022a) Ying-Xin Wu, Xiang Wang, An Zhang, Xiangnan He, and Tat-Seng Chua. 2022a. Discovering invariant rationales for graph neural networks. arXiv preprint arXiv:2201.12872 (2022).
- Xu et al. (2018) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
- You et al. (2021) Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. 2021. Graph contrastive learning automated. In International Conference on Machine Learning. PMLR, 12121–12132.
- Zeng and Xie (2021) Jiaqi Zeng and Pengtao Xie. 2021. Contrastive self-supervised learning for graph classification. In Proceedings of the AAAI conference on Artificial Intelligence, Vol. 35. 10824–10832.
- Zhang et al. (2021) Shengyu Zhang, Kun Kuang, Jiezhong Qiu, Jin Yu, Zhou Zhao, Hongxia Yang, Zhongfei Zhang, and Fei Wu. 2021. Stable prediction on graphs with agnostic distribution shift. arXiv preprint arXiv:2110.03865 (2021).
- Zhang et al. (2022b) Yifei Zhang, Hao Zhu, Zixing Song, Piotr Koniusz, and Irwin King. 2022b. COSTA: covariance-preserving feature augmentation for graph contrastive learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2524–2534.
- Zhang et al. (2022a) Zeyang Zhang, Xin Wang, Ziwei Zhang, Haoyang Li, Zhou Qin, and Wenwu Zhu. 2022a. Dynamic graph neural networks under spatio-temporal distribution shift. Advances in Neural Information Processing Systems 35 (2022), 6074–6089.
- Zhu et al. (2020) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2020. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131 (2020).
- Zhu et al. (2021) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021. 2069–2080.