\settopmatter

printacmref=false

\orcid

0009-0005-9229-5188

Noisy Node Classification
by Bi-level Optimization based Multi-teacher Distillation

Yujing Liu \institutionGuangxi Key Lab of Multisource Information Mining and Security, Guangxi Normal University \cityGuilin \countryChina \postcode541004 liuyujingcn@gmail.com , Zongqian Wu \institutionSchool of Computer Science and Engineering, University of Electronic Science and Technology of China \cityChengdu \countryChina \postcode611731 wkzongqianwu@gmail.com , Zhengyu Lu \institutionGuangxi Key Lab of Multisource Information Mining and Security, Guangxi Normal University \cityGuilin \countryChina \postcode541004 airmanlu@foxmail.com , Ci Nie \institutionGuangxi Key Lab of Multisource Information Mining and Security, Guangxi Normal University \cityGuilin \countryChina \postcode541004 nieci2024@gmail.com , Guoqiu Wen \institutionGuangxi Key Lab of Multisource Information Mining and Security, Guangxi Normal University \cityGuilin \countryChina \postcode541004 wenguoqiu2008@163.com , Ping Hu \institutionSchool of Computer Science and Engineering, University of Electronic Science and Technology of China \cityChengdu \countryChina \postcode610000 chinahuping@gmail.com and Xiaofeng Zhu \institutionGuangxi Key Lab of Multisource Information Mining and Security, Guangxi Normal University \cityGuilin \countryChina \postcode541004 seanzhuxf@gmail.com

Abstract.

Previous graph neural networks (GNNs) usually assume that the graph data is with clean labels for representation learning, but it is not true in real applications. In this paper, we propose a new multi-teacher distillation method based on bi-level optimization (namely BO-NNC), to conduct noisy node classification on the graph data. Specifically, we first employ multiple self-supervised learning methods to train diverse teacher models, and then aggregate their predictions through a teacher weight matrix. Furthermore, we design a new bi-level optimization strategy to dynamically adjust the teacher weight matrix based on the training progress of the student model. Finally, we design a label improvement module to improve the label quality. Extensive experimental results on real datasets show that our method achieves the best results compared to state-of-the-art methods. The source code is listed in Supplementary Materials.

Key words and phrases:

Graph convolutional networks, Noisy labels, Bi-level Optimization, Semi-supervision node Classification

{CCSXML}

<ccs2012> <concept> <concept_id>00000000.0000000.0000000</concept_id> <concept_desc>Computers systems organization, Embedded systems</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>

\ccsdesc

[500]Computing methodologies Neural networks

1. Introduction

Node classification usually relies on correct supervised information to predict unlabelled nodes on the graph data. However, real-world graph data often contain incorrect supervised information, i.e., noisy labels, which easily misleads the training model and in turn degrades the node classification performance. To address this issue, noisy node classification (NNC) has been widely proposed to train machine learning models (e.g., graph neural networks (GNNs)) on the graph data with noisy labels¹¹1Related Work is listed in Appendix A. Li et al. (2021), so NNC has garnered growing interest in practical applications Dai and Wang (2021); Guo et al. (2019).

A number of methods have been proposed to handle noisy labels. For example, NRGNN Dai et al. (2021) trains a pseudo-label miner to select pseudo-labels for augmenting the supervision information. UnionNET Li et al. (2021) first trains a GNN using the original labels and then performs label correction based on the node embeddings learned by the obtained GNN. However, these methods excessively depend on the performance of individual model, and their effectiveness can significantly decrease as the noise rate increases. For instance, in the case of high noise rates, NRGNN will select incorrect pseudo-labels and UnionNET will contaminate correct labels with label correction. One of the key reasons is that previous methods design a single model only to handle with noisy label to result in limited robustness.

To alleviate the above issue, the multi-teacher distillation method is proposed to transfer knowledge from multiple models to deal with noisy labels. For instance, MTS-GNN Liu et al. (2023a) uses models saved from earlier iterations to guide model training and label correction for later iterations. However, previous multi-teacher distillation methods still have limitations to be addressed. First, the constructed multi-teacher models across nearby iterations have few difference, and thus resulting in the lack of diversity. Second, previous methods do not consider the complementary information among teacher models and could not fully exploit the knowledge of teacher models.

Refer to caption — Figure 1. The framework of the proposed BO-NNC, consisting of three modules, i.e., Multi-teacher Construction, Multi-teacher Distillation, and Label Improvement. Specifically, Multi-teacher Construction employs multiple self-supervised learning methods to obtain diverse teacher models. Multi-teacher Distillation transfers the knowledge from the teacher models to the student model through multi-teacher distillation based on bi-level optimization. Specifically, the lower level makes the student model learn the knowledge of the teacher models from the soft label matrix, which is the Hadamard product between the teacher weight matrix and $k$ prediction probability matrices produced by teacher models. The upper level updates the teacher weight matrix based on the training progress of the student model. Label Improvement uses both the student model and the teacher models to first detect noisy labels and then select pseudo-labels.

Addressing the above issues of multi-teacher distillation methods is challenging. First, since the labelled data is sparse and with noisy, it is difficult to train diverse teacher models by existing methods (e.g., randomly sampling training subsets from the training set). Second, previous methods usually require a large number of correct labels to learn the weights of the teacher models, which is not practical for NNC tasks. For example, MTS-GNN sets the weights of teacher models as hyper-parameters, resulting in expensive time cost and difficult to reasonably set the weights.

In this paper, we propose a new multi-teacher distillation method based on bi-level optimization (namely BO-NNC, shown in Figure 1) to address the above issues. To do this, we first employ diverse self-supervised graph methods (e.g., Veličković et al. (2018); Zhu et al. (2021); Mo et al. (2022)) to construct multiple teacher models, solving the first issue of previous methods, and then construct a teacher weight matrix to integrate the predictions of the teacher models as the soft label matrix, which is regarded as the ground truth of the student model. Moreover, we propose a new bi-level optimization strategy to iteratively train the teacher weight matrix and the student model. This allows the teacher weight matrix to be dynamically adjusted based on the training progress of the student model without requiring a large number of correct labels to learn the weights of the teacher models. In particular, the proposed bi-level optimization algorithm helps to learn the complementary information among teacher models. After the bi-level optimization, we further design a label improvement module to improve the label quality of the dataset, which further enhances the learning of multiple teacher models and the student model.

Compared with previous NNC methods, the main contributions of our proposed method are summarized as follows:

$\bullet$

We propose a new method to deal with noisy node classification. Specifically, our proposed method solves the limitations of previous methods by multi-teacher distillations, and designs a label improvement module to gradually improve the quality of labels.
$\bullet$

We design a new distillation method based on bi-level optimization, which automatically adjust the teacher weight matrix, thus more fully exploring the complementary information among multiple teacher models.
$\bullet$

We conduct extensive experiments on five real datasets. Experimental results demonstrate that our proposed method achieves the best performance, compared to existing SOTA methods at different levels of noisy data.

2. Methodology

Denoting $\mathcal{G}=(\mathcal{V},\mathcal{E})$ as a graph, where $\mathcal{V}=\left\{v_{1},\cdots,v_{n}\right\}$ represents the node set with $n$ nodes and $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ represents the edge set, we denote $\mathbf{X}=\left\{\mathbf{x}_{1},\cdots,\mathbf{x}_{n}\right\}\in\mathbb{R}^{n% \times d}$ and $\mathbf{A}\in\mathbb{R}^{n\times n}$ , respectively, as the feature matrix of all nodes and the adjacency matrix of the graph $\mathcal{G}$ , where $\mathbf{x}_{i}\in\mathbb{R}^{d}$ is the $d$ -dimensional feature vector of the $i$ -th node. The labels in the datasets are denoted as $\mathbf{Y}=\{\mathbf{y}_{1},\cdots,\mathbf{y}_{b}\}$ , where $b$ is the number of labelled nodes, $\mathbf{y}_{i}\in\mathbb{R}^{c}$ is one-hot encoding of the $i$ -th label and $c$ is the number of classes.

2.1. Multi-teacher Construction

Existing studies Wu et al. (2022) demonstrate that different representation learning methods usually output diverse node representations. Motivated by such observation, we employ existing graph representation learning algorithms to construct multiple teacher models.

Specifically, we first train $k$ encoders by $k$ different unsupervised graph representation learning methods and then use them to obtain node embeddings:

\displaystyle\mathbf{H}_{i}=\mathbf{F}_{i}(\mathbf{X},\mathbf{A}),

(1)

where $i\in[1,...,k]$ and $\mathbf{F}_{i}$ denotes the $i$ -th encoder. We then use individual embedding as inputs to train $k$ classifiers by the loss function as follows:

\displaystyle\mathcal{L}_{i}=-\sum_{j\in\mathcal{S}}\mathbf{y}_{j}\log\mathbf{% p}_{j}^{i},

(2)

where $\mathcal{S}$ denotes the labelled training set, $\mathbf{p}_{j}^{i}\in\mathbb{R}^{c}$ denotes the prediction of the $i$ -th classifier for node $v_{j}$ , and $\mathbf{y}_{j}$ denotes the label of node $v_{j}$ . In the end, we obtain $k$ teacher models consisting of $k$ encoders and $k$ classifiers to achieve model diversity.

The literature (e.g., Hinton et al. (2015); Kim et al. (2021)) demonstrate that the knowledge in the teacher model can be transferred to the student model by making the output of the student model mimic the output of the teacher model. Motivated by this, we consider using the predictions of the teacher models as soft labels to train the student model. This may enable the student model to explore the complementary information among teacher models, and thus obtaining better generalization performance. In this way, our method may successfully address the issue of the excessive dependence on the performance of a single model in previous methods.

2.2. Multi-teacher Distillation

After constructing multiple teacher models on labelled nodes, in this section, diverse knowledge will be transferred to the student model. To do this, we first employ multiple teachers to generate the node embeddings $\mathbf{H}=\{\mathbf{H}_{1},...,\mathbf{H}_{k}\}$ and prediction probability matrices $\mathbf{P}=\{\mathbf{P}_{1},...,\mathbf{P}_{k}\}$ for all nodes (including labelled nodes and unlabelled nodes), which are then used to guide the student model.

Second, the student model are generated by randomly initialized parameters on all nodes. That is, the node embedding of all nodes $\mathbf{Z}^{(i)}$ in the $i$ -th layer of the student model and the probability matrix $\mathbf{P}^{S}$ of the student model are obtained by:

\begin{cases}{}\mathbf{Z}^{(i)}=\sigma(\mathbf{\hat{A}}\mathbf{Z}^{(i-1)}% \mathbf{W}^{(i-1)}),\\ \mathbf{P}^{S}=\operatorname{softmax}(\sigma(\mathbf{\hat{A}}\mathbf{Z}^{(i)}% \mathbf{W}^{(i)})),\end{cases}

(3)

where $\mathbf{W}^{(i)}$ is the trainable parameters of the $i$ -th layer and $\sigma$ is the activation function.

After that, we fuse $k$ prediction probability matrices to a probability matrix, i.e., soft label matrix, which is the ground truth of the student model. The simple strategy for mapping $k$ prediction probability matrices to the soft label matrix is the mean method. However, such a method equivalently treats every teacher model. A good alternative is co-attention method Gao et al. (2021). Specifically, we learn the weight matrix $\mathcal{W}\in\mathbb{R}^{k\times c}$ ( $c$ denotes the number of classes) to map $k$ prediction probability matrices to the soft label matrix. However, previous co-attention methods usually require a large number of correct labels to learn the weights of the teacher models, which is not feasible in NNC tasks.

To address the above issue, we investigate clean nodes to effectively learn the weight matrix, resulting in a bi-level optimization. Specifically, after $t$ iterations of the student model, we calculate the loss among clean nodes to update the teacher weight matrix by the back-propagation (Details in Section 3). As a result, information obtained from the student model can be used to improve the teacher weight matrix. The updated teacher weight matrix guarantees the effectiveness of soft labels even with low-quality teacher models due to their limited number of labels.

In our method, clean nodes come from unlabelled nodes and labelled nodes. We select clean nodes from unlabelled nodes by first considering the prediction results of all teacher models and then filtering the selected clean nodes by node embedding similarity. Specifically, we first chose unlabelled nodes with consistent prediction among all teacher models to form the candidate set $Can$ :

\displaystyle Can=\{v_{i}\mid\forall j\in(1,k-1),\,\arg\max\mathbf{p}_{i}^{j}=% \arg\max\mathbf{p}_{i}^{k}\},

(4)

where $\mathbf{p}_{i}^{j}$ denotes the prediction of the $j$ -th teacher model for node $v_{i}$ . We then filter clean nodes in $Can$ based on the node embedding similarity between every node in $Can$ and its $\beta_{1}$ ( $\beta_{1}$ is a hyper-parameter) closest nodes in the same class, i.e.,

\displaystyle\xi_{i}=\sum_{j=1}^{k}\sum_{v_{u}\in C_{i}}d(h^{j}_{i},h^{j}_{u}),

(5)

where $C_{i}$ denotes the set consisting of $\beta_{1}$ closest nodes of the same class to node $v_{i}$ , $h^{j}_{i}$ denotes the embedding vector of the $j$ -th teacher for node $v_{i}$ , and $d(\cdot,\cdot)$ denotes Euclidean distance. Finally, we select $\beta_{2}$ ( $\beta_{2}$ is a hyper-parameter) nodes with the smallest $\xi$ (i.e., with the highest confidence) in each class as the first part of clean nodes.

In order to increase the number of clean nodes, we consider selecting the second part of clean nodes from labelled nodes, based on the loss values of the teacher models. Specifically, we first use the teacher models to calculate loss values for all labelled nodes:

\displaystyle{l}_{i}^{T}=-\sum_{j=1}^{k}\mathbf{y}_{i}\log\mathbf{p}_{i}^{j},

(6)

where ${l}_{i}^{T}$ denotes the loss value of node $v_{i}$ . After that, we select $\alpha\%$ ( $\alpha$ is a hyper-parameter) nodes with the lowest loss values (i.e., with the highest confidence) in each class to form the second part of clean nodes.

Finally, we combine two parts of clean nodes (i.e., with the highest confidence out of all nodes) to obtain the clean node set $Cle$ , which is used to calculate the loss of the upper level, in the bi-level optimization. Moreover, the loss in the upper level is then used to update the teacher weight matrix. In the lower level, the teacher weight matrix is used to generate the soft label matrix, which is further used to update the parameters of the student model. In particular, the bi-level optimization strategy makes the teacher weight matrix be dynamically adjusted based on the feedback from the student model. This allows the updated teacher weight matrix to effectively fuse multiple prediction matrices, i.e., extracting their common information across all models and complementary information in individual models.

2.3. Label Improvement

After conducting both multi-teacher construction and multi-teacher distillation, we design a label improvement module to gradually improve the label quality of the dataset during the training process. This module involves two phases, i.e., noisy label filtering and pseudo-label selection.

2.3.1. Noisy label Filtering

Noisy label filtering is designed to filter a portion of noisy labels from the labelled data, thereby reducing incorrect supervision information in the dataset. Existing literature Gui et al. (2021) shows that the node with higher loss has higher probability to be a noisy node than the node with lower loss. Moreover, the loss in one class is different from that in other classes. Hence, we calculate the loss in the $j$ -th class between the node labels and the predictions of the student model by:

\displaystyle L_{j}=\{{l}_{i}^{S}\mid\mathbf{y}_{i}=j\},\leavevmode\nobreak\ % \text{where}\leavevmode\nobreak\ {l}_{i}^{S}=-\mathbf{y}_{i}\log\mathbf{p}_{i}% ^{S},

(7)

where $j\in(1,c)$ denotes the $j$ -th class. We then set a threshold $\bm{\varphi}$ for each class to select noisy nodes:

\displaystyle\bm{\varphi}_{i}=sort(L_{i})_{\lceil{\delta}_{i}r\rceil},

(8)

where ${\delta}_{i}$ denotes the number of labelled nodes of class $i$ , $r$ is a hyper-parameter that denotes the percentage of selected noisy nodes in the total labelled nodes. Finally, nodes with loss values larger than threshold in each class are regarded as noisy nodes and they are regarded as unlabelled nodes in the following part of the label improvement module.

2.3.2. Pseudo-label Selection

After removing noisy labels, pseudo-label selection investigates to assign pseudo-labels to unlabelled nodes, aim at increasing supervision information. We follow the literature Li et al. (2018) to separately select pseudo-labels based on the prediction confidence of both the student model and the teacher models. Specifically, we first use the student model to predict all unlabelled nodes and then record the confidence of the student model by:

\delta^{S}_{i}=\max\mathbf{p}_{i}^{S},

(9)

where $\delta^{S}_{i}$ denotes the prediction confidence of the student model for node $v_{i}$ . Given the prediction confidence, we select the top $\rho$ ( $\rho$ is a hyper-parameter) nodes with the highest $\delta^{S}$ in each class as the first part of pseudo-labelled nodes.

We then sum the predictions of the teacher models as teacher predictions and calculate the confidence of the teacher models by:

\delta^{T}_{i}=\max(\sum_{j=1}^{k}\mathbf{p}^{j}_{i}),

(10)

where $\delta^{T}_{i}$ denotes the prediction confidence of the teacher models for node $v_{i}$ . Similarly, we select the top $\rho$ nodes with the highest $\delta^{T}$ in each class as the second part of pseudo-labelled nodes.

Finally, we assign all pseudo-labelled nodes with pseudo-labels and further add them to the labelled set, which will be used to adaptively update both the multi-teacher construction module and the multi-teacher distillation module, whose robustness are improved on the data with corrected labels and pseudo-labels with high quality.

3. Bi-level Optimization

3.1. Optimization Process

Previous works often separately the teacher weight matrix and the student model in Section 2.2. This ignores the potentiality of the student model to provide information for improving the teacher weight matrix. In this paper, we propose to dynamically adjust the teacher weight matrix based on the training progress of the student model, aiming at reducing the dependence of the learning the teacher weight matrix on correct labels. As a result, the optimization of both the teacher weight matrix and the student model results in a bi-level optimization problem, Liu et al. (2021a); Chen et al. (2022). Considering that the bi-level optimization usually includes two levels of optimization tasks, where the solution of the upper optimization task depends on the results of the lower optimization task. To address the above issue, in this paper, we treat the optimization of the teacher weight matrix as the upper level task and the training of the student model as the lower level task, to have the following objective function:

\displaystyle\underbrace{\underset{\theta_{w}}{\arg\min}\ \mathbf{\Gamma}_{% \scriptscriptstyle\text{up}}(\theta_{w},\theta_{s}^{t})}_{\text{upper level % task}}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ s.t.% \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \theta_{s}^{t}=% \underbrace{\underset{\theta_{s}}{\arg\min}\ \mathbf{\Gamma}_{% \scriptscriptstyle\text{low}}(\theta_{w},\theta_{s})}_{\text{lower level task}},

(11)

where $\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}$ and $\mathbf{\Gamma}_{\scriptscriptstyle\text{low}}$ , respectively, denote the loss functions for the teacher weight matrix and the student model. $\theta_{w}$ and $\theta_{s}$ , respectively, denote the trainable parameters of the teacher weight matrix $\mathcal{W}$ and the student model. $\theta_{s}^{t}$ denotes the parameters of the student model in the $t$ -th iteration.

In the lower-level optimization, we train the student model using soft labels. Specifically, we first construct a teacher weight matrix $\mathcal{W}\in\mathbb{R}^{k\times c}$ to fuse the predictions of $k$ teacher models as a soft label matrix, i.e.,

\displaystyle\widetilde{\mathbf{y}}_{i}=\operatorname{softmax}(\sum_{j=1}^{k}% \mathcal{W}_{j}\odot\mathbf{p}_{i}^{j}),

(12)

where “ $\odot$ " denotes the Hadamard product. We then calculate the loss of the lower level optimization by:

\displaystyle\mathbf{\Gamma}_{\scriptscriptstyle\text{low}}(\theta_{w},\theta_% {s})=\mathcal{L}_{s},\leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ % \mathcal{L}_{s}=-\sum_{i\in\mathcal{T}}\widetilde{\mathbf{y}}_{i}\log\mathbf{p% }_{i}^{S},

(13)

where $\mathcal{T}$ denotes the training set, and $\mathbf{p}_{i}^{S}$ denote the prediction of the student model for node $v_{i}$ .

After calculating $\mathbf{\Gamma}_{\scriptscriptstyle\text{low}}$ , we update the parameters $\theta_{s}$ of the student model by the gradient descent:

\theta_{s}^{\mu}=\theta_{s}^{\mu-1}-\eta_{\mu}\scalebox{1.3}{d}_{{\Gamma}_{% \scriptscriptstyle\text{low}}}(\theta_{s}^{\mu-1};\theta_{w}),

(14)

where $\eta_{\mu}$ denotes the learning rate and $\scalebox{1.3}{d}_{{\Gamma}_{\scriptscriptstyle\text{low}}}(\theta_{s}^{\mu-1}% ;\theta_{w})$ denotes the derivative of $\mathcal{L}_{s}$ with respect to $\theta_{s}^{\mu-1}$ . As a result, in the lower level optimization, we transfer the knowledge from the teacher models to the student model by having the prediction of the student model mimic the soft labels.

After the lower level optimization, we first pass the result $\theta_{s}^{t}$ of the lower level optimization to the upper level optimization, and then conduct the upper-level optimization. To do this, we update the teacher weight matrix using both the student model (after $t$ iterations) and clean nodes. Specifically, the loss function of the upper level optimization is designed as the cross-entropy loss between the prediction of the clean nodes in the student model and the pseudo-labels of clean nodes:

\displaystyle\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}(\theta_{w},\theta_{% s}^{t})=\mathcal{L}_{w},\leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ % \mathcal{L}_{w}=-\sum_{i\in Cle}\hat{\mathbf{y}}_{i}\log\mathbf{p}^{s_{t}}_{i},

(15)

where $Cle$ denotes the clean node set, $\hat{\mathbf{y}}_{i}$ denotes the pseudo-label of $v_{i}$ , and $\mathbf{p}^{s_{t}}_{i}$ denotes the prediction of the student model (after $t$ iterations) for node $v_{i}$ .

Obviously, $\theta_{w}$ is not directly involved in the calculation of $\mathcal{L}_{w}$ , thus we cannot follow the ordinary approach to directly calculate the gradient of $\mathcal{L}_{w}$ with respect to $\theta_{w}$ and update $\theta_{w}$ . To address this issue, we follow the literature Liu et al. (2021a, b) to perform derivations based on specific situation.

Specifically, we denote the outer optimization objective as $\underset{\theta_{w}}{\min}\ \bm{\varphi}(\theta_{w}):=\mathbf{\Gamma}_{% \scriptscriptstyle\text{up}}(\theta_{w},\theta_{s}^{t})$ , where $\theta_{s}^{t}$ is a variable dependent on $\theta_{w}$ , so we have:

\displaystyle\frac{\partial{\bm{\varphi}(\theta_{w})}}{\partial{\theta_{w}}}=% \underbrace{\frac{\partial{\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}(% \theta_{w},\theta_{s}^{t})}}{\partial{\theta_{w}}}}_{\text{direct grad}}+% \underbrace{\frac{\partial{\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}(% \theta_{w},\theta_{s}^{t})}}{\partial{\theta_{s}^{t}}}(\frac{\partial{\theta_{% s}^{t}}}{\partial{\theta_{w}}})}_{\text{indirect grad}}.

(16)

For the Eq.16, on the one hand, $\theta_{w}$ is not directly involved in the calculation of $\mathcal{L}_{w}$ , thus the “direct grad" is zero. On the other hand, the “indirect grad" consists of two parts, i.e., $\frac{\partial{\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}(\theta_{w},\theta% _{s}^{t})}}{\partial{\theta_{s}^{t}}}$ and $\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}$ . Since the former can be directly calculated by the derivation of $\mathcal{L}_{w}$ with respect to $\theta_{s}^{t}$ , the main challenging lies in the calculation both $\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}$ and $\frac{\partial{\bm{\varphi}(\theta_{w})}}{\partial{\theta_{w}}}$ .

In this paper, we first dynamically unfold $\theta_{s}^{t}$ based on Eq.14 to calculate $\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}$ by:

	$\displaystyle\theta_{s}^{t}=\bm{\psi}_{t}(\theta_{s}^{t-1};\theta_{w}),% \leavevmode\nobreak\ \leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ % \leavevmode\nobreak\$		(17)
	$\displaystyle\bm{\psi}_{i}(\theta_{s}^{i-1};\theta_{w})=\theta_{s}^{i-1}-\eta_% {i}\scalebox{1.3}{d}_{\mathbf{\Gamma}_{\scriptscriptstyle\text{low}}}(\theta_{% s}^{i-1};\theta_{w}),\leavevmode\nobreak\ i=1,\cdots,t.$

Combining $\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}$ and Eq.17, we have:

\displaystyle\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}=\frac{% \partial{\bm{\psi}_{t}(\theta_{s}^{t-1};\theta_{w})}}{\partial{\theta_{s}^{t-1% }}}\frac{\partial{\theta_{s}^{t-1}}}{\partial{\theta_{w}}}+\frac{\partial{\bm{% \psi}_{t}(\theta_{s}^{t-1};\theta_{w})}}{\partial{\theta_{w}}}.

(18)

To simplify the notation, we denote:

\mathbf{Z}_{t}=\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}},\ \mathbf% {A}_{t}=\frac{\partial{\bm{\psi}_{t}(\theta_{s}^{t-1};\theta_{w})}}{\partial{% \theta_{s}^{t-1}}},\ \mathbf{B}_{t}=\frac{\partial{\bm{\psi}_{t}(\theta_{s}^{t% -1};\theta_{w})}}{\partial{\theta_{w}}}.

(19)

We then rewrite Eq.18 to:

\displaystyle\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}=\mathbf{Z}_% {t}=\sum\limits_{i=0}^{t}{(\prod\limits_{j=0}^{t}\mathbf{A}_{j})\mathbf{B}_{i}},

(20)

where $\mathbf{A}_{0}=\mathbf{B}_{0}=0$ . $\mathbf{A}_{i}$ and $\mathbf{B}_{i}$ ( $i\neq 0$ ), respectively, can be obtained by calculating the derivative of $\scalebox{1.3}{d}_{\mathbf{\Gamma}_{\scriptscriptstyle\text{low}}}(\theta_{s}^% {i-1};\theta_{w})$ with respect to $\theta_{s}^{i-1}$ and $\theta_{w}$ .

Based on the above derivation, it is obviously that we can iteratively calculate $\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}$ during the optimization of the lower level optimization. Finally, we calculate $\frac{\partial{\bm{\varphi}(\theta_{w})}}{\partial{\theta_{w}}}=\frac{\partial% {\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}(\theta_{w},\theta_{s}^{t})}}{% \partial{\theta_{s}^{t}}}(\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}})$ and update the parameters $\theta_{w}$ of the teacher weight matrix by the gradient descent method:

\theta_{w}^{\lambda}=\theta_{w}^{\lambda-1}-\eta_{\lambda}\scalebox{1.3}{d}_{% \bm{\varphi}(\theta_{w})}(\theta_{s}^{t};\theta_{w}^{\lambda-1}),

(21)

where $\eta_{\lambda}$ denotes the learning rate and $\scalebox{1.3}{d}_{\bm{\varphi}(\theta_{w})}(\theta_{s}^{t};\theta_{w}^{% \lambda-1})$ denotes the derivative of $\mathcal{L}_{w}$ with respect to $\theta_{w}^{\lambda-1}$ .

In the upper level optimization, we use the feedback from the lower level optimization (i.e., the student model) to update the teacher weight matrix, and thus rationally adjust the weights of the teacher models. After updating the teacher weight matrix, we map the predictions of $k$ teacher models to soft labels through Eq.12 and then train the student models again (i.e., the lower level optimization). By iteratively updating the upper level optimization and lower level optimization, we fully transfer knowledge from the teacher models to the student model.

Finally, through bi-level optimization strategy, we successfully addressed the issue that previous methods require a large number of correct labels to learn the weights of the teacher models (Validated in Section 4.5).

3.2. Complexity Analysis

3.2.1. Time Complexity

The bi-level optimization needs to first construct upper and lower level optimization tasks and and then iteratively optimize these two tasks. Specifically, in the forward propagation, our method needs the time cost of O( $n^{2}d$ ), where $n$ and $d$ , respectively, are the number of nodes and dimensions. In the back propagation, our bi-level optimization requires to calculate the gradient $\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}$ and its complexity is O( $s+w$ ) Franceschi et al. (2017), where $w$ is the parameter number of the teacher weight matrix, usually $s>>w$ . As a result, the time complexity of our bi-level optimization method is O( $n^{2}d+s+w$ ). In contrast, traditional optimization methods need O( $n^{2}d+s$ ). Obviously, they are quadratic to the node number.

3.2.2. Space Complexity

The bi-level optimization strategy calculates the gradients of the loss for the parameters of the student model to have the space complexity of O( $(n+d+c)n+s$ ), where the space complexity of the feature matrix, the adjacency matrix, the prediction matrix, the model parameter matrix, and the parameter gradient matrix, respectively, are with O( $nd$ ), O( $n^{2}$ ), O( $nc$ ), O( $s$ ), and O( $s$ ), where $c$ denotes the output dimension of the model. The bi-level optimization strategy need to store $\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}$ , $\frac{\partial{\bm{\psi}_{t}(\theta_{s}^{t-1};\theta_{w})}}{\partial{\theta_{s% }^{t-1}}}$ and $\frac{\partial{\bm{\psi}_{t}(\theta_{s}^{t-1};\theta_{w})}}{\partial{\theta_{w% }}}$ in the lower level optimization task as well. Hence, the space complexity in our bi-level optimization strategy is O( $(n+d+c)n+s+w$ ), which is quadratic to the node number. In contrast, the space complexity of the traditional optimization strategy (e.g., the first-order gradient descent methods) is of O( $(n+d+c)n+s$ ).

4. Experiments

Table 1. Classification accuracy (average accuracy (%) and standard deviation) of all methods with different noise rates on all datasets. Note that, the bold number represents the best results in the whole column.

Datasets	Methods	Uniform			Pair
Datasets	Methods	20%	40%	60%	20%	40%
Cora	GCN	74.74 (0.6)	69.40 (1.6)	41.76 (1.9)	72.58 (2.1)	51.82 (0.4)
	DGI	81.56 (0.2)	79.92 (0.4)	48.94 (0.3)	81.16 (0.3)	61.60 (0.5)
	GREET	80.88 (0.8)	78.22 (1.3)	44.70 (2.8)	77.96 (1.7)	56.42 (1.8)
	JoCoR	75.58 (0.8)	71.14 (1.1)	42.82 (1.7)	74.46 (1.1)	52.54 (1.2)
	NRGNN	79.14 (0.7)	78.68 (1.0)	52.02 (2.1)	78.66 (1.8)	56.30 (5.9)
	MTS-GNN	81.50 (0.8)	79.26 (0.7)	63.34 (3.1)	80.68 (1.8)	62.16 (2.2)
	Proposed	82.64 (1.1)	82.20 (0.9)	73.16 (1.0)	81.64 (1.6)	73.46 (3.2)
Citeseer	GCN	63.18 (0.6)	56.04 (2.0)	37.74 (0.7)	66.60 (0.9)	40.68 (2.7)
	DGI	69.24 (0.2)	63.20 (0.1)	42.14 (0.3)	69.24 (0.3)	45.14 (0.2)
	GREET	68.36 (0.9)	62.46 (1.2)	44.50 (1.7)	70.74 (1.5)	46.88 (1.0)
	JoCoR	69.06 (1.2)	57.18 (1.4)	38.64 (1.0)	69.54 (0.6)	46.82 (2.4)
	NRGNN	66.22 (1.2)	59.66 (1.5)	45.82 (3.2)	67.86 (1.8)	50.10 (2.1)
	MTS-GNN	71.94 (1.6)	69.74 (1.1)	53.64 (2.1)	72.34 (1.1)	59.84 (2.8)
	Proposed	72.14 (0.7)	70.80 (0.8)	58.18 (3.9)	72.90 (0.5)	63.36 (3.9)
DBLP	GCN	80.32 (0.2)	77.54 (1.1)	65.05 (0.6)	80.19 (0.5)	61.56 (1.2)
	DGI	79.86 (0.2)	79.28 (0.2)	67.59 (0.2)	78.99 (0.1)	70.29 (0.5)
	GREET	80.62 (0.3)	79.83 (0.5)	69.10 (1.0)	81.16 (0.2)	71.41 (1.8)
	JoCoR	80.77 (0.3)	78.52 (0.9)	67.32 (1.1)	80.26 (0.3)	64.71 (3.3)
	NRGNN	81.89 (0.4)	82.20 (0.3)	76.13 (1.6)	82.01 (0.4)	72.80 (1.2)
	MTS-GNN	82.76 (0.2)	82.49 (0.3)	78.82 (1.5)	82.68 (2.4)	70.35 (4.5)
	Proposed	83.93 (0.2)	83.87 (0.2)	81.18 (0.3)	83.56 (1.7)	77.32 (0.8)
Photo	GCN	88.15 (0.6)	84.51 (0.7)	73.27 (5.9)	88.50 (0.8)	66.20 (1.3)
	DGI	87.83 (0.4)	86.78 (0.5)	78.61 (0.9)	85.67 (0.1)	70.71 (0.2)
	GREET	88.77 (0.4)	86.39 (0.5)	74.71 (1.2)	88.06 (0.4)	65.66 (0.7)
	JoCoR	88.34 (1.1)	84.17 (0.8)	79.22 (2.1)	88.78 (0.5)	70.02 (2.9)
	NRGNN	88.30 (0.5)	86.17 (0.6)	77.25 (3.5)	88.61 (0.6)	66.20 (2.4)
	MTS-GNN	91.01 (1.4)	88.10 (1.2)	85.12 (1.7)	91.74 (0.3)	81.26 (4.6)
	Proposed	91.75 (0.1)	90.06 (1.0)	87.01 (0.9)	91.26 (0.3)	86.96 (2.3)
Computers	GCN	86.66 (0.3)	83.18 (0.5)	78.07 (0.4)	84.60 (0.1)	72.91 (0.8)
	DGI	83.12 (0.1)	77.71 (0.1)	75.89 (0.3)	81.73 (0.2)	77.71 (0.1)
	GREET	84.16 (0.2)	80.75 (0.8)	72.19 (1.2)	83.05 (0.1)	64.35 (0.9)
	JoCoR	86.96 (0.4)	83.46 (0.4)	77.99 (0.5)	84.92 (0.4)	73.93 (0.9)
	NRGNN	86.51 (0.5)	83.63 (0.5)	77.37 (0.8)	84.34 (0.5)	74.22 (1.1)
	MTS-GNN	86.43 (0.2)	84.29 (0.2)	80.65 (0.1)	86.36 (0.4)	76.55 (2.1)
	Proposed	85.81 (0.2)	84.89 (0.1)	80.85 (0.3)	86.04 (0.2)	82.38 (1.2)

4.1. Experimental Settings

4.1.1. Datasets

The benchmark datasets in our experiments include three citation datasets (i.e., DBLP Bojchevski and Günnemann (2017), Cora and Citeseer Yang et al. (2016)), and two business datasets (i.e., Computers and Photo Shchur et al. (2018)).

4.1.2. Comparison Methods

The comparison methods include two baseline methods (i.e., GCN Kipf and Welling (2016) and GAT Veličković et al. (2017)), four unsupervised graph representation learning method (i.e., DGI Veličković et al. (2018), GCA Zhu et al. (2021), SUGRL Mo et al. (2022), and GREET Liu et al. (2023b)), and three noisy label methods (i.e., JoCoR Wei et al. (2020), NRGNN Dai et al. (2021), and MTS-GNN Liu et al. (2023a)).

Table 2. Classification accuracy (average accuracy (%) and standard deviation) of the proposed BO-NNC with different components on all datasets at the highest noise rates for the two noise types. Note that the bold number represents the best results in the whole column.

Uniform (60%)
C1	C2	C3	Cora	Citeseer	DBLP	Photo	Computers
✓			55.22 (1.0)	44.74 (2.4)	77.57 (0.8)	83.74 (1.3)	79.54 (0.4)
		✓	72.50 (1.3)	53.24 (3.0)	80.54 (0.4)	86.23 (1.1)	80.45 (0.6)
✓	✓		58.56 (5.0)	45.28 (2.1)	76.67 (0.6)	84.04 (1.0)	79.30 (2.0)
✓		✓	71.26 (1.4)	56.78 (3.5)	81.25 (0.3)	86.10 (0.7)	80.36 (0.7)
✓	✓	✓	73.16 (1.0)	58.18 (3.9)	81.95 (0.3)	87.01 (0.9)	80.85 (0.3)

Pair (40%) C1 C2 C3 Cora Citeseer DBLP Photo Computers ✓ 67.14 (3.1) 51.56 (2.3) 76.05 (1.0) 78.70 (5.2) 79.42 (1.1) ✓ 70.52 (6.4) 63.56 (2.0) 74.33 (0.7) 81.08 (1.2) 81.09 (0.6) ✓ ✓ 65.76 (5.3) 55.62 (2.1) 75.22 (1.8) 79.39 (5.0) 76.43 (2.3) ✓ ✓ 73.36 (3.5) 62.08 (2.5) 76.01 (1.0) 84.95 (2.2) 81.35 (1.3) ✓ ✓ ✓ 77.50 (2.9) 63.36 (3.9) 77.32 (0.8) 86.96 (2.3) 82.38 (1.2)

Table 3. Classification accuracy (average accuracy (%) and standard deviation) of different multi-teacher distillation methods at 60% noise rates on all datasets. Note that, the bold number represents the best results in the whole column.

Uniform (60%)
	Cora	Citeseer	DBLP	Photo	Computers
Mean	71.26 (1.4)	56.78 (3.5)	81.25 (0.3)	86.10 (0.7)	80.36 (0.7)
Co-attention	71.82 (0.9)	57.24 (3.2)	81.35 (0.3)	85.66 (1.7)	80.78 (0.6)
Proposed	73.16 (1.0)	58.18 (3.9)	81.95 (0.3)	87.01 (0.9)	80.85 (0.3)

Pair (40%) Cora Citeseer DBLP Photo Computers Mean 73.36 (3.5) 62.08 (2.5) 76.01 (1.0) 84.95 (2.2) 81.35 (1.3) Co-attention 72.96 (0.4) 62.10 (2.7) 77.17 (0.9) 85.08 (1.9) 80.80 (1.6) Proposed 77.50 (2.9) 63.36 (3.9) 77.32 (0.8) 86.96 (2.3) 82.38 (1.2)

4.1.3. Setting-up

We follow the PyTorch Geometric library to split the datasets Cora and Citeseer, and randomly sample 5%, 10%, and 60% of the nodes, respectively, as labelled training set, validation set, and test set for the datasets DBLP, Photo, and Computers. There is no overlap among the training set, the validation set and the test set. We follow the literature Han et al. (2018); Jiang et al. (2018) to add different rates of noisy into the real datasets. Specifically, given a noise rate $p$ , we first generate noisy labels over all classes according to a noise transition matrix $\mathbf{Q}\in\mathbb{R}^{c\times c}$ , where $q_{ij}=\mathbf{p}(\overline{y}=j|y=i)$ is the probability of clean label $y$ being flipped to the noisy label $\overline{y}$ , and then corrupt the labels of label training set with two types of label noise:

$\bullet$

Uniform noise The label has $p$ probability of being randomly flipped to other class. Furthermore, we set noise rates of 20%, 40%, and 60% for the uniform noise.
$\bullet$

Pair noise Labels are assumed to make mistakes only within the most similar pair classes. More specifically, label has $p$ probability of being randomly flipped to their pair classes. Furthermore, we set noise rates of 20%, 40% for the pair noise.

To reduce randomness of experimental results, we conduct five experiments with different random seeds and report the average results and corresponding standard deviations. In our experiments, we train three encoders by employing GCA Zhu et al. (2021), DGI Veličković et al. (2018) and SUGRL Mo et al. (2022), plus single layer MLP for every encoder to obtain three teacher classifiers. For a fair comparison, the methods, including JoCoR, NRGNN, MTS-GNN, and the student model of our method use a two-layer GCN as the backbone. In addition, we obtain the source code of all comparison methods from the authors and set the parameters of all comparison methods according to the original literature so that they output the best performance on all datasets.

4.2. Result Analysis

We compare our proposed BO-NNC with all comparison methods on five datsets in terms of node classification tasks with different noise rates and report the results in Table 1.

First, the proposed BO-NNC consistently achieves the best results on all datasets, followed by MTS-GNN, NRGNN, DGI, GREET, JoCoR, GAT, GCA, SUGRL and GCN. The reason is that our BO-NNC enables the student model to obtain the knowledge from multiple teacher models, thus solving the limitations of the single model to effectively deal with noisy labels. Second, our BO-NNC outperforms two baseline methods (i.e., GCN and GAT) and four unsupervised graph representation learning method (i.e., GCA, DGI, SUGRL, and GREET) by a large margin, because the proposed method provides more correct supervised information for the model training by noisy label filtering and pseudo-label selection. For example, the proposed BO-NNC averagely improves by 2.93%, 4.99%, and 13.44% respectively, compared with the DGI that performs best in there six methods, in uniform noisy rates of 20%, 40%, and 60% on all datasets. Third, our MTS-GNN achieves promising improvements, compared with the methods that deal with the noisy label problem, i.e., JoCoR, NRGNN, and MTS-GNN. This indicates that our method fully leverages the complementary information among teacher models through bi-level optimization, and thus enabling the student model to achieve better generalization performance.

4.3. Ablation Study

Our method contains three important components, i.e., using the student model to integrate knowledge from multiple teacher models (C1 for short), multi-teacher knowledge distillation based on bi-level optimization (C2 for short), and label improvement (C3 for short). To demonstrate the effectiveness of each component, we report the node classification performance of different component combinations on all datasets at the highest noisy rates in Table 2.

First, our method with all components averagely improves by 5.01%, compared with the methods with one component only. This indicates that all three components are essential for our method, which verifies the feasibility of our proposed method. Second, the overall effectiveness of the method considering the student model is significantly superior to the one without considering the student model, because the student model can explore the complementary information from the teacher models. Last, bi-level optimization needs enough clean node to achieve effectiveness, while label improvement may provide enough clean data. Hence, it is reasonable to simultaneously consider bi-level optimization based multi-teacher distillation and label improvement for dealing with noisy labels.

4.4. Effectiveness of Multi-Teacher Construction

We conduct experiments to investigate the effectiveness of our constructed multi-teacher models on all datasets at the highest noisy rates (i.e., 60%) for two types of noise and report the results in Appendix C.

Obviously, the performance of the student model increases with the increase of the number of teacher models. Specifically, compared to the best performance of the methods with only one teacher model, the results of our method with three teacher models improves by an average of 2.68% and 2.49%, respectively, for uniform noise and pair noise on all datasets. This suggests that multi-teacher models provide diverse complementary information to the student models.

4.5. Effectiveness of Bi-level Optimization

We compare our method with the mean method and the co-attention method on all datasets at the highest noisy rates (i.e., 60%) for two noise types and report the results in Table 3, to verify the effectiveness of the proposed bi-level optimization strategy.

Obviously, our proposed method achieves the best results. Specifically, compared with the best comparison method, i.e., the co-attention method, our method improves by on average of 0.86% and 1.88%, respectively, for uniform noise and pair noise on all datasets. This fully demonstrates that our proposed method can more effectively capture the complementary information among teacher models, compared to all comparison methods.

4.6. Parameter Sensitivity Analysis

Our method has five key hyper-parameters, i.e., $\rho$ for pseudo-label selection, $r$ for noisy label selection, and $\beta_{1}$ , $\beta_{2}$ , and $\alpha$ for selecting the clean node set. We investigate the sensitivity of our method to these hyper-parameters at 60% noisy rate, and report the results in Figures 2 ~ 4.

Based on the experimental results, first, our method is not sensitive to the hyper-parameters, including $\beta_{1}$ , $\beta_{2}$ , and $\alpha$ , because our bi-level optimization process does not rely on a large number of correct labels. Second, our method is sensitive to the parameters $r$ and $\rho$ . Specifically, if the value of $r$ is less than 0.4, the performance of the model gradually increases with the increase of $r$ . If the value of $r$ is larger than 0.6, the performance of the model gradually decreases with the increase of $r$ . The reason is that too many correct labels were removed. , If the value of $\rho$ is less than 70, the variation of $\rho$ has little impact on the model performance. If the value of $\rho$ is larger than 70, the performance of the model gradually decreases with the increase of $\rho$ . The reason is that the incorrect pseudo-label will be inevitably selected.

5. Conclusion

In this paper, we proposed a new NNC method to address the limitations in previous methods. Specifically, multi-teacher distillations transfer diverse information to the learning of the student model. The bi-level optimization strategy fully explores the complementary information among multiple teacher models for the learning of the student model so that requiring less number of correct labels to achieve a robust student model. Experimental results showed that our method achieves supreme performance, compared to the state-of-the-art methods, in terms of noisy node classification tasks.

References

(1)
Bojchevski and Günnemann (2017) Aleksandar Bojchevski and Stephan Günnemann. 2017. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. arXiv preprint arXiv:1707.03815 (2017).
Chen et al. (2022) Can Chen, Xi Chen, Chen Ma, Zixuan Liu, and Xue Liu. 2022. Gradient-based bi-level optimization for deep learning: A survey. arXiv preprint arXiv:2207.11719 (2022).
Dai et al. (2021) Enyan Dai, Charu Aggarwal, and Suhang Wang. 2021. Nrgnn: Learning a label noise resistant graph neural network on sparsely and noisily labeled graphs. In KDD. 227–236.
Dai and Wang (2021) Enyan Dai and Suhang Wang. 2021. Say no to the discrimination: Learning fair graph neural networks with limited sensitive attribute information. In WSDM. 680–688.
Franceschi et al. (2017) Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. 2017. Forward and reverse gradient-based hyperparameter optimization. In ICML. PMLR, 1165–1173.
Gao et al. (2021) Yan Gao, Titouan Parcollet, and Nicholas D Lane. 2021. Distilling knowledge from ensembles of acoustic models for joint ctc-attention end-to-end speech recognition. In ASRU. IEEE, 138–145.
Gui et al. (2021) Xian-Jin Gui, Wei Wang, and Zhang-Hao Tian. 2021. Towards understanding deep learning from noisy labels with small-loss criterion. arXiv preprint arXiv:2106.09291 (2021).
Guo et al. (2019) Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In AAAI, Vol. 33. 922–929.
Han et al. (2018) Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. NeurIPS 31 (2018).
Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
Jiang et al. (2018) Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML. PMLR, 2304–2313.
Kim et al. (2021) Kyungyul Kim, ByeongMoon Ji, Doyoung Yoon, and Sangheum Hwang. 2021. Self-knowledge distillation with progressive refinement of targets. In ICCV. 6567–6576.
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
Li et al. (2018) Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, Vol. 32.
Li et al. (2021) Yayong Li, Jie Yin, and Ling Chen. 2021. Unified robust training for graph neural networks against label noise. In PAKDD. Springer, 528–540.
Liu et al. (2021a) Risheng Liu, Jiaxin Gao, Jin Zhang, Deyu Meng, and Zhouchen Lin. 2021a. Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. TPAMI 44, 12 (2021), 10045–10067.
Liu et al. (2021b) Risheng Liu, Xuan Liu, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. 2021b. A value-function-based interior-point method for non-convex bi-level optimization. In ICML. PMLR, 6882–6892.
Liu et al. (2023a) Yujing Liu, Zongqian Wu, Zhengyu Lu, Guoqiu Wen, Junbo Ma, Guangquan Lu, and Xiaofeng Zhu. 2023a. Multi-teacher Self-training for Semi-supervised Node Classification with Noisy Labels. In ACM MM. 2946–2954.
Liu et al. (2023b) Yixin Liu, Yizhen Zheng, Daokun Zhang, Vincent CS Lee, and Shirui Pan. 2023b. Beyond smoothing: Unsupervised graph representation learning with edge heterophily discriminating. In AAAI, Vol. 37. 4516–4524.
Mo et al. (2022) Yujie Mo, Liang Peng, Jie Xu, Xiaoshuang Shi, and Xiaofeng Zhu. 2022. Simple unsupervised graph representation learning. In AAAI, Vol. 36. 7797–7805.
Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018).
Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
Veličković et al. (2018) Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2018. Deep graph infomax. arXiv preprint arXiv:1809.10341 (2018).
Wei et al. (2020) Hongxin Wei, Lei Feng, Xiangyu Chen, and Bo An. 2020. Combating noisy labels by agreement: A joint training method with co-regularization. In CVPR. 13726–13735.
Wu et al. (2022) Lirong Wu, Yufei Huang, Haitao Lin, Zicheng Liu, Tianyu Fan, and Stan Z Li. 2022. Automated Graph Self-supervised Learning via Multi-teacher Knowledge Distillation. arXiv preprint arXiv:2210.02099 (2022).
Yang et al. (2016) Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi-supervised learning with graph embeddings. In ICML. PMLR, 40–48.
Zhu et al. (2021) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021. 2069–2080.

Noisy Node Classification by Bi-level Optimization based Multi-teacher Distillation