\settopmatter

printacmref=false

\orcid

0009-0005-9229-5188

Noisy Node Classification
by Bi-level Optimization based Multi-teacher Distillation

Yujing Liu \institutionGuangxi Key Lab of Multisource Information Mining and Security, Guangxi Normal University \cityGuilin \countryChina \postcode541004 liuyujingcn@gmail.com Zongqian Wu \institutionSchool of Computer Science and Engineering, University of Electronic Science and Technology of China \cityChengdu \countryChina \postcode611731 wkzongqianwu@gmail.com Zhengyu Lu \institutionGuangxi Key Lab of Multisource Information Mining and Security, Guangxi Normal University \cityGuilin \countryChina \postcode541004 airmanlu@foxmail.com Ci Nie \institutionGuangxi Key Lab of Multisource Information Mining and Security, Guangxi Normal University \cityGuilin \countryChina \postcode541004 nieci2024@gmail.com Guoqiu Wen \institutionGuangxi Key Lab of Multisource Information Mining and Security, Guangxi Normal University \cityGuilin \countryChina \postcode541004 wenguoqiu2008@163.com Ping Hu \institutionSchool of Computer Science and Engineering, University of Electronic Science and Technology of China \cityChengdu \countryChina \postcode610000 chinahuping@gmail.com  and  Xiaofeng Zhu \institutionGuangxi Key Lab of Multisource Information Mining and Security, Guangxi Normal University \cityGuilin \countryChina \postcode541004 seanzhuxf@gmail.com
Abstract.

Previous graph neural networks (GNNs) usually assume that the graph data is with clean labels for representation learning, but it is not true in real applications. In this paper, we propose a new multi-teacher distillation method based on bi-level optimization (namely BO-NNC), to conduct noisy node classification on the graph data. Specifically, we first employ multiple self-supervised learning methods to train diverse teacher models, and then aggregate their predictions through a teacher weight matrix. Furthermore, we design a new bi-level optimization strategy to dynamically adjust the teacher weight matrix based on the training progress of the student model. Finally, we design a label improvement module to improve the label quality. Extensive experimental results on real datasets show that our method achieves the best results compared to state-of-the-art methods. The source code is listed in Supplementary Materials.

Key words and phrases:
Graph convolutional networks, Noisy labels, Bi-level Optimization, Semi-supervision node Classification
{CCSXML}

<ccs2012> <concept> <concept_id>00000000.0000000.0000000</concept_id> <concept_desc>Computers systems organization, Embedded systems</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>

\ccsdesc

[500]Computing methodologies Neural networks

1. Introduction

Node classification usually relies on correct supervised information to predict unlabelled nodes on the graph data. However, real-world graph data often contain incorrect supervised information, i.e., noisy labels, which easily misleads the training model and in turn degrades the node classification performance. To address this issue, noisy node classification (NNC) has been widely proposed to train machine learning models (e.g., graph neural networks (GNNs)) on the graph data with noisy labels111Related Work is listed in Appendix A. Li et al. (2021), so NNC has garnered growing interest in practical applications Dai and Wang (2021); Guo et al. (2019).

A number of methods have been proposed to handle noisy labels. For example, NRGNN Dai et al. (2021) trains a pseudo-label miner to select pseudo-labels for augmenting the supervision information. UnionNET Li et al. (2021) first trains a GNN using the original labels and then performs label correction based on the node embeddings learned by the obtained GNN. However, these methods excessively depend on the performance of individual model, and their effectiveness can significantly decrease as the noise rate increases. For instance, in the case of high noise rates, NRGNN will select incorrect pseudo-labels and UnionNET will contaminate correct labels with label correction. One of the key reasons is that previous methods design a single model only to handle with noisy label to result in limited robustness.

To alleviate the above issue, the multi-teacher distillation method is proposed to transfer knowledge from multiple models to deal with noisy labels. For instance, MTS-GNN Liu et al. (2023a) uses models saved from earlier iterations to guide model training and label correction for later iterations. However, previous multi-teacher distillation methods still have limitations to be addressed. First, the constructed multi-teacher models across nearby iterations have few difference, and thus resulting in the lack of diversity. Second, previous methods do not consider the complementary information among teacher models and could not fully exploit the knowledge of teacher models.

Refer to caption
Figure 1. The framework of the proposed BO-NNC, consisting of three modules, i.e., Multi-teacher Construction, Multi-teacher Distillation, and Label Improvement. Specifically, Multi-teacher Construction employs multiple self-supervised learning methods to obtain diverse teacher models. Multi-teacher Distillation transfers the knowledge from the teacher models to the student model through multi-teacher distillation based on bi-level optimization. Specifically, the lower level makes the student model learn the knowledge of the teacher models from the soft label matrix, which is the Hadamard product between the teacher weight matrix and k𝑘kitalic_k prediction probability matrices produced by teacher models. The upper level updates the teacher weight matrix based on the training progress of the student model. Label Improvement uses both the student model and the teacher models to first detect noisy labels and then select pseudo-labels.

Addressing the above issues of multi-teacher distillation methods is challenging. First, since the labelled data is sparse and with noisy, it is difficult to train diverse teacher models by existing methods (e.g., randomly sampling training subsets from the training set). Second, previous methods usually require a large number of correct labels to learn the weights of the teacher models, which is not practical for NNC tasks. For example, MTS-GNN sets the weights of teacher models as hyper-parameters, resulting in expensive time cost and difficult to reasonably set the weights.

In this paper, we propose a new multi-teacher distillation method based on bi-level optimization (namely BO-NNC, shown in Figure 1) to address the above issues. To do this, we first employ diverse self-supervised graph methods (e.g., Veličković et al. (2018); Zhu et al. (2021); Mo et al. (2022)) to construct multiple teacher models, solving the first issue of previous methods, and then construct a teacher weight matrix to integrate the predictions of the teacher models as the soft label matrix, which is regarded as the ground truth of the student model. Moreover, we propose a new bi-level optimization strategy to iteratively train the teacher weight matrix and the student model. This allows the teacher weight matrix to be dynamically adjusted based on the training progress of the student model without requiring a large number of correct labels to learn the weights of the teacher models. In particular, the proposed bi-level optimization algorithm helps to learn the complementary information among teacher models. After the bi-level optimization, we further design a label improvement module to improve the label quality of the dataset, which further enhances the learning of multiple teacher models and the student model.

Compared with previous NNC methods, the main contributions of our proposed method are summarized as follows:

  • \bullet

    We propose a new method to deal with noisy node classification. Specifically, our proposed method solves the limitations of previous methods by multi-teacher distillations, and designs a label improvement module to gradually improve the quality of labels.

  • \bullet

    We design a new distillation method based on bi-level optimization, which automatically adjust the teacher weight matrix, thus more fully exploring the complementary information among multiple teacher models.

  • \bullet

    We conduct extensive experiments on five real datasets. Experimental results demonstrate that our proposed method achieves the best performance, compared to existing SOTA methods at different levels of noisy data.

2. Methodology

Denoting 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ) as a graph, where 𝒱={v1,,vn}𝒱subscript𝑣1subscript𝑣𝑛\mathcal{V}=\left\{v_{1},\cdots,v_{n}\right\}caligraphic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } represents the node set with n𝑛nitalic_n nodes and 𝒱×𝒱𝒱𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}caligraphic_E ⊆ caligraphic_V × caligraphic_V represents the edge set, we denote 𝐗={𝐱1,,𝐱n}n×d𝐗subscript𝐱1subscript𝐱𝑛superscript𝑛𝑑\mathbf{X}=\left\{\mathbf{x}_{1},\cdots,\mathbf{x}_{n}\right\}\in\mathbb{R}^{n% \times d}bold_X = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT and 𝐀n×n𝐀superscript𝑛𝑛\mathbf{A}\in\mathbb{R}^{n\times n}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT, respectively, as the feature matrix of all nodes and the adjacency matrix of the graph 𝒢𝒢\mathcal{G}caligraphic_G, where 𝐱idsubscript𝐱𝑖superscript𝑑\mathbf{x}_{i}\in\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the d𝑑ditalic_d-dimensional feature vector of the i𝑖iitalic_i-th node. The labels in the datasets are denoted as 𝐘={𝐲1,,𝐲b}𝐘subscript𝐲1subscript𝐲𝑏\mathbf{Y}=\{\mathbf{y}_{1},\cdots,\mathbf{y}_{b}\}bold_Y = { bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT }, where b𝑏bitalic_b is the number of labelled nodes, 𝐲icsubscript𝐲𝑖superscript𝑐\mathbf{y}_{i}\in\mathbb{R}^{c}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is one-hot encoding of the i𝑖iitalic_i-th label and c𝑐citalic_c is the number of classes.

2.1. Multi-teacher Construction

Existing studies Wu et al. (2022) demonstrate that different representation learning methods usually output diverse node representations. Motivated by such observation, we employ existing graph representation learning algorithms to construct multiple teacher models.

Specifically, we first train k𝑘kitalic_k encoders by k𝑘kitalic_k different unsupervised graph representation learning methods and then use them to obtain node embeddings:

𝐇i=𝐅i(𝐗,𝐀),subscript𝐇𝑖subscript𝐅𝑖𝐗𝐀\displaystyle\mathbf{H}_{i}=\mathbf{F}_{i}(\mathbf{X},\mathbf{A}),bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X , bold_A ) , (1)

where i[1,,k]𝑖1𝑘i\in[1,...,k]italic_i ∈ [ 1 , … , italic_k ] and 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the i𝑖iitalic_i-th encoder. We then use individual embedding as inputs to train k𝑘kitalic_k classifiers by the loss function as follows:

i=j𝒮𝐲jlog𝐩ji,subscript𝑖subscript𝑗𝒮subscript𝐲𝑗superscriptsubscript𝐩𝑗𝑖\displaystyle\mathcal{L}_{i}=-\sum_{j\in\mathcal{S}}\mathbf{y}_{j}\log\mathbf{% p}_{j}^{i},caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_S end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , (2)

where 𝒮𝒮\mathcal{S}caligraphic_S denotes the labelled training set, 𝐩jicsuperscriptsubscript𝐩𝑗𝑖superscript𝑐\mathbf{p}_{j}^{i}\in\mathbb{R}^{c}bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denotes the prediction of the i𝑖iitalic_i-th classifier for node vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and 𝐲jsubscript𝐲𝑗\mathbf{y}_{j}bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denotes the label of node vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In the end, we obtain k𝑘kitalic_k teacher models consisting of k𝑘kitalic_k encoders and k𝑘kitalic_k classifiers to achieve model diversity.

The literature (e.g., Hinton et al. (2015); Kim et al. (2021)) demonstrate that the knowledge in the teacher model can be transferred to the student model by making the output of the student model mimic the output of the teacher model. Motivated by this, we consider using the predictions of the teacher models as soft labels to train the student model. This may enable the student model to explore the complementary information among teacher models, and thus obtaining better generalization performance. In this way, our method may successfully address the issue of the excessive dependence on the performance of a single model in previous methods.

2.2. Multi-teacher Distillation

After constructing multiple teacher models on labelled nodes, in this section, diverse knowledge will be transferred to the student model. To do this, we first employ multiple teachers to generate the node embeddings 𝐇={𝐇1,,𝐇k}𝐇subscript𝐇1subscript𝐇𝑘\mathbf{H}=\{\mathbf{H}_{1},...,\mathbf{H}_{k}\}bold_H = { bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and prediction probability matrices 𝐏={𝐏1,,𝐏k}𝐏subscript𝐏1subscript𝐏𝑘\mathbf{P}=\{\mathbf{P}_{1},...,\mathbf{P}_{k}\}bold_P = { bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } for all nodes (including labelled nodes and unlabelled nodes), which are then used to guide the student model.

Second, the student model are generated by randomly initialized parameters on all nodes. That is, the node embedding of all nodes 𝐙(i)superscript𝐙𝑖\mathbf{Z}^{(i)}bold_Z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in the i𝑖iitalic_i-th layer of the student model and the probability matrix 𝐏Ssuperscript𝐏𝑆\mathbf{P}^{S}bold_P start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT of the student model are obtained by:

{𝐙(i)=σ(𝐀^𝐙(i1)𝐖(i1)),𝐏S=softmax(σ(𝐀^𝐙(i)𝐖(i))),casessuperscript𝐙𝑖𝜎^𝐀superscript𝐙𝑖1superscript𝐖𝑖1otherwisesuperscript𝐏𝑆softmax𝜎^𝐀superscript𝐙𝑖superscript𝐖𝑖otherwise\begin{cases}{}\mathbf{Z}^{(i)}=\sigma(\mathbf{\hat{A}}\mathbf{Z}^{(i-1)}% \mathbf{W}^{(i-1)}),\\ \mathbf{P}^{S}=\operatorname{softmax}(\sigma(\mathbf{\hat{A}}\mathbf{Z}^{(i)}% \mathbf{W}^{(i)})),\end{cases}{ start_ROW start_CELL bold_Z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_σ ( over^ start_ARG bold_A end_ARG bold_Z start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL bold_P start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = roman_softmax ( italic_σ ( over^ start_ARG bold_A end_ARG bold_Z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) , end_CELL start_CELL end_CELL end_ROW (3)

where 𝐖(i)superscript𝐖𝑖\mathbf{W}^{(i)}bold_W start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is the trainable parameters of the i𝑖iitalic_i-th layer and σ𝜎\sigmaitalic_σ is the activation function.

After that, we fuse k𝑘kitalic_k prediction probability matrices to a probability matrix, i.e., soft label matrix, which is the ground truth of the student model. The simple strategy for mapping k𝑘kitalic_k prediction probability matrices to the soft label matrix is the mean method. However, such a method equivalently treats every teacher model. A good alternative is co-attention method Gao et al. (2021). Specifically, we learn the weight matrix 𝒲k×c𝒲superscript𝑘𝑐\mathcal{W}\in\mathbb{R}^{k\times c}caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_c end_POSTSUPERSCRIPT (c𝑐citalic_c denotes the number of classes) to map k𝑘kitalic_k prediction probability matrices to the soft label matrix. However, previous co-attention methods usually require a large number of correct labels to learn the weights of the teacher models, which is not feasible in NNC tasks.

To address the above issue, we investigate clean nodes to effectively learn the weight matrix, resulting in a bi-level optimization. Specifically, after t𝑡titalic_t iterations of the student model, we calculate the loss among clean nodes to update the teacher weight matrix by the back-propagation (Details in Section 3). As a result, information obtained from the student model can be used to improve the teacher weight matrix. The updated teacher weight matrix guarantees the effectiveness of soft labels even with low-quality teacher models due to their limited number of labels.

In our method, clean nodes come from unlabelled nodes and labelled nodes. We select clean nodes from unlabelled nodes by first considering the prediction results of all teacher models and then filtering the selected clean nodes by node embedding similarity. Specifically, we first chose unlabelled nodes with consistent prediction among all teacher models to form the candidate set Can𝐶𝑎𝑛Canitalic_C italic_a italic_n:

Can={vij(1,k1),argmax𝐩ij=argmax𝐩ik},𝐶𝑎𝑛conditional-setsubscript𝑣𝑖formulae-sequencefor-all𝑗1𝑘1superscriptsubscript𝐩𝑖𝑗superscriptsubscript𝐩𝑖𝑘\displaystyle Can=\{v_{i}\mid\forall j\in(1,k-1),\,\arg\max\mathbf{p}_{i}^{j}=% \arg\max\mathbf{p}_{i}^{k}\},italic_C italic_a italic_n = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ ∀ italic_j ∈ ( 1 , italic_k - 1 ) , roman_arg roman_max bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = roman_arg roman_max bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } , (4)

where 𝐩ijsuperscriptsubscript𝐩𝑖𝑗\mathbf{p}_{i}^{j}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT denotes the prediction of the j𝑗jitalic_j-th teacher model for node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We then filter clean nodes in Can𝐶𝑎𝑛Canitalic_C italic_a italic_n based on the node embedding similarity between every node in Can𝐶𝑎𝑛Canitalic_C italic_a italic_n and its β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a hyper-parameter) closest nodes in the same class, i.e.,

ξi=j=1kvuCid(hij,huj),subscript𝜉𝑖superscriptsubscript𝑗1𝑘subscriptsubscript𝑣𝑢subscript𝐶𝑖𝑑subscriptsuperscript𝑗𝑖subscriptsuperscript𝑗𝑢\displaystyle\xi_{i}=\sum_{j=1}^{k}\sum_{v_{u}\in C_{i}}d(h^{j}_{i},h^{j}_{u}),italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d ( italic_h start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) , (5)

where Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the set consisting of β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT closest nodes of the same class to node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, hijsubscriptsuperscript𝑗𝑖h^{j}_{i}italic_h start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the embedding vector of the j𝑗jitalic_j-th teacher for node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) denotes Euclidean distance. Finally, we select β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a hyper-parameter) nodes with the smallest ξ𝜉\xiitalic_ξ (i.e., with the highest confidence) in each class as the first part of clean nodes.

In order to increase the number of clean nodes, we consider selecting the second part of clean nodes from labelled nodes, based on the loss values of the teacher models. Specifically, we first use the teacher models to calculate loss values for all labelled nodes:

liT=j=1k𝐲ilog𝐩ij,superscriptsubscript𝑙𝑖𝑇superscriptsubscript𝑗1𝑘subscript𝐲𝑖superscriptsubscript𝐩𝑖𝑗\displaystyle{l}_{i}^{T}=-\sum_{j=1}^{k}\mathbf{y}_{i}\log\mathbf{p}_{i}^{j},italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , (6)

where liTsuperscriptsubscript𝑙𝑖𝑇{l}_{i}^{T}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the loss value of node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. After that, we select α%percent𝛼\alpha\%italic_α % (α𝛼\alphaitalic_α is a hyper-parameter) nodes with the lowest loss values (i.e., with the highest confidence) in each class to form the second part of clean nodes.

Finally, we combine two parts of clean nodes (i.e., with the highest confidence out of all nodes) to obtain the clean node set Cle𝐶𝑙𝑒Cleitalic_C italic_l italic_e, which is used to calculate the loss of the upper level, in the bi-level optimization. Moreover, the loss in the upper level is then used to update the teacher weight matrix. In the lower level, the teacher weight matrix is used to generate the soft label matrix, which is further used to update the parameters of the student model. In particular, the bi-level optimization strategy makes the teacher weight matrix be dynamically adjusted based on the feedback from the student model. This allows the updated teacher weight matrix to effectively fuse multiple prediction matrices, i.e., extracting their common information across all models and complementary information in individual models.

2.3. Label Improvement

After conducting both multi-teacher construction and multi-teacher distillation, we design a label improvement module to gradually improve the label quality of the dataset during the training process. This module involves two phases, i.e., noisy label filtering and pseudo-label selection.

2.3.1. Noisy label Filtering

Noisy label filtering is designed to filter a portion of noisy labels from the labelled data, thereby reducing incorrect supervision information in the dataset. Existing literature Gui et al. (2021) shows that the node with higher loss has higher probability to be a noisy node than the node with lower loss. Moreover, the loss in one class is different from that in other classes. Hence, we calculate the loss in the j𝑗jitalic_j-th class between the node labels and the predictions of the student model by:

Lj={liS𝐲i=j},whereliS=𝐲ilog𝐩iS,formulae-sequencesubscript𝐿𝑗conditional-setsuperscriptsubscript𝑙𝑖𝑆subscript𝐲𝑖𝑗wheresuperscriptsubscript𝑙𝑖𝑆subscript𝐲𝑖superscriptsubscript𝐩𝑖𝑆\displaystyle L_{j}=\{{l}_{i}^{S}\mid\mathbf{y}_{i}=j\},\leavevmode\nobreak\ % \text{where}\leavevmode\nobreak\ {l}_{i}^{S}=-\mathbf{y}_{i}\log\mathbf{p}_{i}% ^{S},italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ∣ bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j } , where italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = - bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , (7)

where j(1,c)𝑗1𝑐j\in(1,c)italic_j ∈ ( 1 , italic_c ) denotes the j𝑗jitalic_j-th class. We then set a threshold 𝝋𝝋\bm{\varphi}bold_italic_φ for each class to select noisy nodes:

𝝋i=sort(Li)δir,subscript𝝋𝑖𝑠𝑜𝑟𝑡subscriptsubscript𝐿𝑖subscript𝛿𝑖𝑟\displaystyle\bm{\varphi}_{i}=sort(L_{i})_{\lceil{\delta}_{i}r\rceil},bold_italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s italic_o italic_r italic_t ( italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT ⌈ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r ⌉ end_POSTSUBSCRIPT , (8)

where δisubscript𝛿𝑖{\delta}_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the number of labelled nodes of class i𝑖iitalic_i, r𝑟ritalic_r is a hyper-parameter that denotes the percentage of selected noisy nodes in the total labelled nodes. Finally, nodes with loss values larger than threshold in each class are regarded as noisy nodes and they are regarded as unlabelled nodes in the following part of the label improvement module.

2.3.2. Pseudo-label Selection

After removing noisy labels, pseudo-label selection investigates to assign pseudo-labels to unlabelled nodes, aim at increasing supervision information. We follow the literature Li et al. (2018) to separately select pseudo-labels based on the prediction confidence of both the student model and the teacher models. Specifically, we first use the student model to predict all unlabelled nodes and then record the confidence of the student model by:

δiS=max𝐩iS,subscriptsuperscript𝛿𝑆𝑖superscriptsubscript𝐩𝑖𝑆\delta^{S}_{i}=\max\mathbf{p}_{i}^{S},italic_δ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_max bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , (9)

where δiSsubscriptsuperscript𝛿𝑆𝑖\delta^{S}_{i}italic_δ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the prediction confidence of the student model for node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Given the prediction confidence, we select the top ρ𝜌\rhoitalic_ρ (ρ𝜌\rhoitalic_ρ is a hyper-parameter) nodes with the highest δSsuperscript𝛿𝑆\delta^{S}italic_δ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT in each class as the first part of pseudo-labelled nodes.

We then sum the predictions of the teacher models as teacher predictions and calculate the confidence of the teacher models by:

δiT=max(j=1k𝐩ij),subscriptsuperscript𝛿𝑇𝑖superscriptsubscript𝑗1𝑘subscriptsuperscript𝐩𝑗𝑖\delta^{T}_{i}=\max(\sum_{j=1}^{k}\mathbf{p}^{j}_{i}),italic_δ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_max ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_p start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (10)

where δiTsubscriptsuperscript𝛿𝑇𝑖\delta^{T}_{i}italic_δ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the prediction confidence of the teacher models for node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Similarly, we select the top ρ𝜌\rhoitalic_ρ nodes with the highest δTsuperscript𝛿𝑇\delta^{T}italic_δ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in each class as the second part of pseudo-labelled nodes.

Finally, we assign all pseudo-labelled nodes with pseudo-labels and further add them to the labelled set, which will be used to adaptively update both the multi-teacher construction module and the multi-teacher distillation module, whose robustness are improved on the data with corrected labels and pseudo-labels with high quality.

3. Bi-level Optimization

3.1. Optimization Process

Previous works often separately the teacher weight matrix and the student model in Section 2.2. This ignores the potentiality of the student model to provide information for improving the teacher weight matrix. In this paper, we propose to dynamically adjust the teacher weight matrix based on the training progress of the student model, aiming at reducing the dependence of the learning the teacher weight matrix on correct labels. As a result, the optimization of both the teacher weight matrix and the student model results in a bi-level optimization problem, Liu et al. (2021a); Chen et al. (2022). Considering that the bi-level optimization usually includes two levels of optimization tasks, where the solution of the upper optimization task depends on the results of the lower optimization task. To address the above issue, in this paper, we treat the optimization of the teacher weight matrix as the upper level task and the training of the student model as the lower level task, to have the following objective function:

argminθw𝚪up(θw,θst)upper level tasks.t.θst=argminθs𝚪low(θw,θs)lower level task,\displaystyle\underbrace{\underset{\theta_{w}}{\arg\min}\ \mathbf{\Gamma}_{% \scriptscriptstyle\text{up}}(\theta_{w},\theta_{s}^{t})}_{\text{upper level % task}}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ s.t.% \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \theta_{s}^{t}=% \underbrace{\underset{\theta_{s}}{\arg\min}\ \mathbf{\Gamma}_{% \scriptscriptstyle\text{low}}(\theta_{w},\theta_{s})}_{\text{lower level task}},under⏟ start_ARG start_UNDERACCENT italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_arg roman_min end_ARG bold_Γ start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT upper level task end_POSTSUBSCRIPT italic_s . italic_t . italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = under⏟ start_ARG start_UNDERACCENT italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_arg roman_min end_ARG bold_Γ start_POSTSUBSCRIPT low end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT lower level task end_POSTSUBSCRIPT , (11)

where 𝚪upsubscript𝚪up\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}bold_Γ start_POSTSUBSCRIPT up end_POSTSUBSCRIPT and 𝚪lowsubscript𝚪low\mathbf{\Gamma}_{\scriptscriptstyle\text{low}}bold_Γ start_POSTSUBSCRIPT low end_POSTSUBSCRIPT, respectively, denote the loss functions for the teacher weight matrix and the student model. θwsubscript𝜃𝑤\theta_{w}italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, respectively, denote the trainable parameters of the teacher weight matrix 𝒲𝒲\mathcal{W}caligraphic_W and the student model. θstsuperscriptsubscript𝜃𝑠𝑡\theta_{s}^{t}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denotes the parameters of the student model in the t𝑡titalic_t-th iteration.

In the lower-level optimization, we train the student model using soft labels. Specifically, we first construct a teacher weight matrix 𝒲k×c𝒲superscript𝑘𝑐\mathcal{W}\in\mathbb{R}^{k\times c}caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_c end_POSTSUPERSCRIPT to fuse the predictions of k𝑘kitalic_k teacher models as a soft label matrix, i.e.,

𝐲~i=softmax(j=1k𝒲j𝐩ij),subscript~𝐲𝑖softmaxsuperscriptsubscript𝑗1𝑘direct-productsubscript𝒲𝑗superscriptsubscript𝐩𝑖𝑗\displaystyle\widetilde{\mathbf{y}}_{i}=\operatorname{softmax}(\sum_{j=1}^{k}% \mathcal{W}_{j}\odot\mathbf{p}_{i}^{j}),over~ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_softmax ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⊙ bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , (12)

where “direct-product\odot" denotes the Hadamard product. We then calculate the loss of the lower level optimization by:

𝚪low(θw,θs)=s,wheres=i𝒯𝐲~ilog𝐩iS,formulae-sequencesubscript𝚪lowsubscript𝜃𝑤subscript𝜃𝑠subscript𝑠wheresubscript𝑠subscript𝑖𝒯subscript~𝐲𝑖superscriptsubscript𝐩𝑖𝑆\displaystyle\mathbf{\Gamma}_{\scriptscriptstyle\text{low}}(\theta_{w},\theta_% {s})=\mathcal{L}_{s},\leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ % \mathcal{L}_{s}=-\sum_{i\in\mathcal{T}}\widetilde{\mathbf{y}}_{i}\log\mathbf{p% }_{i}^{S},bold_Γ start_POSTSUBSCRIPT low end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , where caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_T end_POSTSUBSCRIPT over~ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , (13)

where 𝒯𝒯\mathcal{T}caligraphic_T denotes the training set, and 𝐩iSsuperscriptsubscript𝐩𝑖𝑆\mathbf{p}_{i}^{S}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT denote the prediction of the student model for node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

After calculating 𝚪lowsubscript𝚪low\mathbf{\Gamma}_{\scriptscriptstyle\text{low}}bold_Γ start_POSTSUBSCRIPT low end_POSTSUBSCRIPT, we update the parameters θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT of the student model by the gradient descent:

θsμ=θsμ1ημdΓlow(θsμ1;θw),superscriptsubscript𝜃𝑠𝜇superscriptsubscript𝜃𝑠𝜇1subscript𝜂𝜇subscriptdsubscriptΓlowsuperscriptsubscript𝜃𝑠𝜇1subscript𝜃𝑤\theta_{s}^{\mu}=\theta_{s}^{\mu-1}-\eta_{\mu}\scalebox{1.3}{d}_{{\Gamma}_{% \scriptscriptstyle\text{low}}}(\theta_{s}^{\mu-1};\theta_{w}),italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT = italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ - 1 end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT d start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT low end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) , (14)

where ημsubscript𝜂𝜇\eta_{\mu}italic_η start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT denotes the learning rate and dΓlow(θsμ1;θw)subscriptdsubscriptΓlowsuperscriptsubscript𝜃𝑠𝜇1subscript𝜃𝑤\scalebox{1.3}{d}_{{\Gamma}_{\scriptscriptstyle\text{low}}}(\theta_{s}^{\mu-1}% ;\theta_{w})d start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT low end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) denotes the derivative of ssubscript𝑠\mathcal{L}_{s}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with respect to θsμ1superscriptsubscript𝜃𝑠𝜇1\theta_{s}^{\mu-1}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ - 1 end_POSTSUPERSCRIPT. As a result, in the lower level optimization, we transfer the knowledge from the teacher models to the student model by having the prediction of the student model mimic the soft labels.

After the lower level optimization, we first pass the result θstsuperscriptsubscript𝜃𝑠𝑡\theta_{s}^{t}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT of the lower level optimization to the upper level optimization, and then conduct the upper-level optimization. To do this, we update the teacher weight matrix using both the student model (after t𝑡titalic_t iterations) and clean nodes. Specifically, the loss function of the upper level optimization is designed as the cross-entropy loss between the prediction of the clean nodes in the student model and the pseudo-labels of clean nodes:

𝚪up(θw,θst)=w,wherew=iCle𝐲^ilog𝐩ist,formulae-sequencesubscript𝚪upsubscript𝜃𝑤superscriptsubscript𝜃𝑠𝑡subscript𝑤wheresubscript𝑤subscript𝑖𝐶𝑙𝑒subscript^𝐲𝑖subscriptsuperscript𝐩subscript𝑠𝑡𝑖\displaystyle\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}(\theta_{w},\theta_{% s}^{t})=\mathcal{L}_{w},\leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ % \mathcal{L}_{w}=-\sum_{i\in Cle}\hat{\mathbf{y}}_{i}\log\mathbf{p}^{s_{t}}_{i},bold_Γ start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , where caligraphic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_C italic_l italic_e end_POSTSUBSCRIPT over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log bold_p start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (15)

where Cle𝐶𝑙𝑒Cleitalic_C italic_l italic_e denotes the clean node set, 𝐲^isubscript^𝐲𝑖\hat{\mathbf{y}}_{i}over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the pseudo-label of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 𝐩istsubscriptsuperscript𝐩subscript𝑠𝑡𝑖\mathbf{p}^{s_{t}}_{i}bold_p start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the prediction of the student model (after t𝑡titalic_t iterations) for node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Obviously, θwsubscript𝜃𝑤\theta_{w}italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is not directly involved in the calculation of wsubscript𝑤\mathcal{L}_{w}caligraphic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, thus we cannot follow the ordinary approach to directly calculate the gradient of wsubscript𝑤\mathcal{L}_{w}caligraphic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT with respect to θwsubscript𝜃𝑤\theta_{w}italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and update θwsubscript𝜃𝑤\theta_{w}italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. To address this issue, we follow the literature Liu et al. (2021a, b) to perform derivations based on specific situation.

Specifically, we denote the outer optimization objective as minθw𝝋(θw):=𝚪up(θw,θst)assignsubscript𝜃𝑤𝝋subscript𝜃𝑤subscript𝚪upsubscript𝜃𝑤superscriptsubscript𝜃𝑠𝑡\underset{\theta_{w}}{\min}\ \bm{\varphi}(\theta_{w}):=\mathbf{\Gamma}_{% \scriptscriptstyle\text{up}}(\theta_{w},\theta_{s}^{t})start_UNDERACCENT italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG bold_italic_φ ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) := bold_Γ start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), where θstsuperscriptsubscript𝜃𝑠𝑡\theta_{s}^{t}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is a variable dependent on θwsubscript𝜃𝑤\theta_{w}italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, so we have:

𝝋(θw)θw=𝚪up(θw,θst)θwdirect grad+𝚪up(θw,θst)θst(θstθw)indirect grad.𝝋subscript𝜃𝑤subscript𝜃𝑤subscriptsubscript𝚪upsubscript𝜃𝑤superscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤direct gradsubscriptsubscript𝚪upsubscript𝜃𝑤superscriptsubscript𝜃𝑠𝑡superscriptsubscript𝜃𝑠𝑡superscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤indirect grad\displaystyle\frac{\partial{\bm{\varphi}(\theta_{w})}}{\partial{\theta_{w}}}=% \underbrace{\frac{\partial{\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}(% \theta_{w},\theta_{s}^{t})}}{\partial{\theta_{w}}}}_{\text{direct grad}}+% \underbrace{\frac{\partial{\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}(% \theta_{w},\theta_{s}^{t})}}{\partial{\theta_{s}^{t}}}(\frac{\partial{\theta_{% s}^{t}}}{\partial{\theta_{w}}})}_{\text{indirect grad}}.divide start_ARG ∂ bold_italic_φ ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG = under⏟ start_ARG divide start_ARG ∂ bold_Γ start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG end_ARG start_POSTSUBSCRIPT direct grad end_POSTSUBSCRIPT + under⏟ start_ARG divide start_ARG ∂ bold_Γ start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ( divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG ) end_ARG start_POSTSUBSCRIPT indirect grad end_POSTSUBSCRIPT . (16)

For the Eq.16, on the one hand, θwsubscript𝜃𝑤\theta_{w}italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is not directly involved in the calculation of wsubscript𝑤\mathcal{L}_{w}caligraphic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, thus the “direct grad" is zero. On the other hand, the “indirect grad" consists of two parts, i.e., 𝚪up(θw,θst)θstsubscript𝚪upsubscript𝜃𝑤superscriptsubscript𝜃𝑠𝑡superscriptsubscript𝜃𝑠𝑡\frac{\partial{\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}(\theta_{w},\theta% _{s}^{t})}}{\partial{\theta_{s}^{t}}}divide start_ARG ∂ bold_Γ start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG and θstθwsuperscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG. Since the former can be directly calculated by the derivation of wsubscript𝑤\mathcal{L}_{w}caligraphic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT with respect to θstsuperscriptsubscript𝜃𝑠𝑡\theta_{s}^{t}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, the main challenging lies in the calculation both θstθwsuperscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG and 𝝋(θw)θw𝝋subscript𝜃𝑤subscript𝜃𝑤\frac{\partial{\bm{\varphi}(\theta_{w})}}{\partial{\theta_{w}}}divide start_ARG ∂ bold_italic_φ ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG.

In this paper, we first dynamically unfold θstsuperscriptsubscript𝜃𝑠𝑡\theta_{s}^{t}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT based on Eq.14 to calculate θstθwsuperscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG by:

θst=𝝍t(θst1;θw),wheresuperscriptsubscript𝜃𝑠𝑡subscript𝝍𝑡superscriptsubscript𝜃𝑠𝑡1subscript𝜃𝑤where\displaystyle\theta_{s}^{t}=\bm{\psi}_{t}(\theta_{s}^{t-1};\theta_{w}),% \leavevmode\nobreak\ \leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ % \leavevmode\nobreak\ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) , where (17)
𝝍i(θsi1;θw)=θsi1ηid𝚪low(θsi1;θw),i=1,,t.formulae-sequencesubscript𝝍𝑖superscriptsubscript𝜃𝑠𝑖1subscript𝜃𝑤superscriptsubscript𝜃𝑠𝑖1subscript𝜂𝑖subscriptdsubscript𝚪lowsuperscriptsubscript𝜃𝑠𝑖1subscript𝜃𝑤𝑖1𝑡\displaystyle\bm{\psi}_{i}(\theta_{s}^{i-1};\theta_{w})=\theta_{s}^{i-1}-\eta_% {i}\scalebox{1.3}{d}_{\mathbf{\Gamma}_{\scriptscriptstyle\text{low}}}(\theta_{% s}^{i-1};\theta_{w}),\leavevmode\nobreak\ i=1,\cdots,t.bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT d start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT low end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) , italic_i = 1 , ⋯ , italic_t .

Combining θstθwsuperscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG and Eq.17, we have:

θstθw=𝝍t(θst1;θw)θst1θst1θw+𝝍t(θst1;θw)θw.superscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤subscript𝝍𝑡superscriptsubscript𝜃𝑠𝑡1subscript𝜃𝑤superscriptsubscript𝜃𝑠𝑡1superscriptsubscript𝜃𝑠𝑡1subscript𝜃𝑤subscript𝝍𝑡superscriptsubscript𝜃𝑠𝑡1subscript𝜃𝑤subscript𝜃𝑤\displaystyle\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}=\frac{% \partial{\bm{\psi}_{t}(\theta_{s}^{t-1};\theta_{w})}}{\partial{\theta_{s}^{t-1% }}}\frac{\partial{\theta_{s}^{t-1}}}{\partial{\theta_{w}}}+\frac{\partial{\bm{% \psi}_{t}(\theta_{s}^{t-1};\theta_{w})}}{\partial{\theta_{w}}}.divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG = divide start_ARG ∂ bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG + divide start_ARG ∂ bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG . (18)

To simplify the notation, we denote:

𝐙t=θstθw,𝐀t=𝝍t(θst1;θw)θst1,𝐁t=𝝍t(θst1;θw)θw.formulae-sequencesubscript𝐙𝑡superscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤formulae-sequencesubscript𝐀𝑡subscript𝝍𝑡superscriptsubscript𝜃𝑠𝑡1subscript𝜃𝑤superscriptsubscript𝜃𝑠𝑡1subscript𝐁𝑡subscript𝝍𝑡superscriptsubscript𝜃𝑠𝑡1subscript𝜃𝑤subscript𝜃𝑤\mathbf{Z}_{t}=\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}},\ \mathbf% {A}_{t}=\frac{\partial{\bm{\psi}_{t}(\theta_{s}^{t-1};\theta_{w})}}{\partial{% \theta_{s}^{t-1}}},\ \mathbf{B}_{t}=\frac{\partial{\bm{\psi}_{t}(\theta_{s}^{t% -1};\theta_{w})}}{\partial{\theta_{w}}}.bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG , bold_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∂ bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG , bold_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∂ bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG . (19)

We then rewrite Eq.18 to:

θstθw=𝐙t=i=0t(j=0t𝐀j)𝐁i,superscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤subscript𝐙𝑡superscriptsubscript𝑖0𝑡superscriptsubscriptproduct𝑗0𝑡subscript𝐀𝑗subscript𝐁𝑖\displaystyle\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}=\mathbf{Z}_% {t}=\sum\limits_{i=0}^{t}{(\prod\limits_{j=0}^{t}\mathbf{A}_{j})\mathbf{B}_{i}},divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG = bold_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( ∏ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (20)

where 𝐀0=𝐁0=0subscript𝐀0subscript𝐁00\mathbf{A}_{0}=\mathbf{B}_{0}=0bold_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. 𝐀isubscript𝐀𝑖\mathbf{A}_{i}bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐁isubscript𝐁𝑖\mathbf{B}_{i}bold_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i0𝑖0i\neq 0italic_i ≠ 0), respectively, can be obtained by calculating the derivative of d𝚪low(θsi1;θw)subscriptdsubscript𝚪lowsuperscriptsubscript𝜃𝑠𝑖1subscript𝜃𝑤\scalebox{1.3}{d}_{\mathbf{\Gamma}_{\scriptscriptstyle\text{low}}}(\theta_{s}^% {i-1};\theta_{w})d start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT low end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) with respect to θsi1superscriptsubscript𝜃𝑠𝑖1\theta_{s}^{i-1}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT and θwsubscript𝜃𝑤\theta_{w}italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT.

Based on the above derivation, it is obviously that we can iteratively calculate θstθwsuperscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG during the optimization of the lower level optimization. Finally, we calculate 𝝋(θw)θw=𝚪up(θw,θst)θst(θstθw)𝝋subscript𝜃𝑤subscript𝜃𝑤subscript𝚪upsubscript𝜃𝑤superscriptsubscript𝜃𝑠𝑡superscriptsubscript𝜃𝑠𝑡superscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤\frac{\partial{\bm{\varphi}(\theta_{w})}}{\partial{\theta_{w}}}=\frac{\partial% {\mathbf{\Gamma}_{\scriptscriptstyle\text{up}}(\theta_{w},\theta_{s}^{t})}}{% \partial{\theta_{s}^{t}}}(\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}})divide start_ARG ∂ bold_italic_φ ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG = divide start_ARG ∂ bold_Γ start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ( divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG ) and update the parameters θwsubscript𝜃𝑤\theta_{w}italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT of the teacher weight matrix by the gradient descent method:

θwλ=θwλ1ηλd𝝋(θw)(θst;θwλ1),superscriptsubscript𝜃𝑤𝜆superscriptsubscript𝜃𝑤𝜆1subscript𝜂𝜆subscriptd𝝋subscript𝜃𝑤superscriptsubscript𝜃𝑠𝑡superscriptsubscript𝜃𝑤𝜆1\theta_{w}^{\lambda}=\theta_{w}^{\lambda-1}-\eta_{\lambda}\scalebox{1.3}{d}_{% \bm{\varphi}(\theta_{w})}(\theta_{s}^{t};\theta_{w}^{\lambda-1}),italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT = italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ - 1 end_POSTSUPERSCRIPT - italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT d start_POSTSUBSCRIPT bold_italic_φ ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ - 1 end_POSTSUPERSCRIPT ) , (21)

where ηλsubscript𝜂𝜆\eta_{\lambda}italic_η start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT denotes the learning rate and d𝝋(θw)(θst;θwλ1)subscriptd𝝋subscript𝜃𝑤superscriptsubscript𝜃𝑠𝑡superscriptsubscript𝜃𝑤𝜆1\scalebox{1.3}{d}_{\bm{\varphi}(\theta_{w})}(\theta_{s}^{t};\theta_{w}^{% \lambda-1})d start_POSTSUBSCRIPT bold_italic_φ ( italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ - 1 end_POSTSUPERSCRIPT ) denotes the derivative of wsubscript𝑤\mathcal{L}_{w}caligraphic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT with respect to θwλ1superscriptsubscript𝜃𝑤𝜆1\theta_{w}^{\lambda-1}italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ - 1 end_POSTSUPERSCRIPT.

In the upper level optimization, we use the feedback from the lower level optimization (i.e., the student model) to update the teacher weight matrix, and thus rationally adjust the weights of the teacher models. After updating the teacher weight matrix, we map the predictions of k𝑘kitalic_k teacher models to soft labels through Eq.12 and then train the student models again (i.e., the lower level optimization). By iteratively updating the upper level optimization and lower level optimization, we fully transfer knowledge from the teacher models to the student model.

Finally, through bi-level optimization strategy, we successfully addressed the issue that previous methods require a large number of correct labels to learn the weights of the teacher models (Validated in Section 4.5).

3.2. Complexity Analysis

3.2.1. Time Complexity

The bi-level optimization needs to first construct upper and lower level optimization tasks and and then iteratively optimize these two tasks. Specifically, in the forward propagation, our method needs the time cost of O(n2dsuperscript𝑛2𝑑n^{2}ditalic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d), where n𝑛nitalic_n and d𝑑ditalic_d, respectively, are the number of nodes and dimensions. In the back propagation, our bi-level optimization requires to calculate the gradient θstθwsuperscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG and its complexity is O(s+w𝑠𝑤s+witalic_s + italic_w) Franceschi et al. (2017), where w𝑤witalic_w is the parameter number of the teacher weight matrix, usually s>>wmuch-greater-than𝑠𝑤s>>witalic_s > > italic_w. As a result, the time complexity of our bi-level optimization method is O(n2d+s+wsuperscript𝑛2𝑑𝑠𝑤n^{2}d+s+witalic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d + italic_s + italic_w). In contrast, traditional optimization methods need O(n2d+ssuperscript𝑛2𝑑𝑠n^{2}d+sitalic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d + italic_s). Obviously, they are quadratic to the node number.

3.2.2. Space Complexity

The bi-level optimization strategy calculates the gradients of the loss for the parameters of the student model to have the space complexity of O((n+d+c)n+s𝑛𝑑𝑐𝑛𝑠(n+d+c)n+s( italic_n + italic_d + italic_c ) italic_n + italic_s), where the space complexity of the feature matrix, the adjacency matrix, the prediction matrix, the model parameter matrix, and the parameter gradient matrix, respectively, are with O(nd𝑛𝑑nditalic_n italic_d), O(n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT), O(nc𝑛𝑐ncitalic_n italic_c), O(s𝑠sitalic_s), and O(s𝑠sitalic_s), where c𝑐citalic_c denotes the output dimension of the model. The bi-level optimization strategy need to store θstθwsuperscriptsubscript𝜃𝑠𝑡subscript𝜃𝑤\frac{\partial{\theta_{s}^{t}}}{\partial{\theta_{w}}}divide start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG, 𝝍t(θst1;θw)θst1subscript𝝍𝑡superscriptsubscript𝜃𝑠𝑡1subscript𝜃𝑤superscriptsubscript𝜃𝑠𝑡1\frac{\partial{\bm{\psi}_{t}(\theta_{s}^{t-1};\theta_{w})}}{\partial{\theta_{s% }^{t-1}}}divide start_ARG ∂ bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG and 𝝍t(θst1;θw)θwsubscript𝝍𝑡superscriptsubscript𝜃𝑠𝑡1subscript𝜃𝑤subscript𝜃𝑤\frac{\partial{\bm{\psi}_{t}(\theta_{s}^{t-1};\theta_{w})}}{\partial{\theta_{w% }}}divide start_ARG ∂ bold_italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG in the lower level optimization task as well. Hence, the space complexity in our bi-level optimization strategy is O((n+d+c)n+s+w𝑛𝑑𝑐𝑛𝑠𝑤(n+d+c)n+s+w( italic_n + italic_d + italic_c ) italic_n + italic_s + italic_w), which is quadratic to the node number. In contrast, the space complexity of the traditional optimization strategy (e.g., the first-order gradient descent methods) is of O((n+d+c)n+s𝑛𝑑𝑐𝑛𝑠(n+d+c)n+s( italic_n + italic_d + italic_c ) italic_n + italic_s).

4. Experiments

Table 1. Classification accuracy (average accuracy (%) and standard deviation) of all methods with different noise rates on all datasets. Note that, the bold number represents the best results in the whole column.
Datasets Methods Uniform Pair
20% 40% 60% 20% 40%
Cora GCN 74.74 (0.6) 69.40 (1.6) 41.76 (1.9) 72.58 (2.1) 51.82 (0.4)
DGI 81.56 (0.2) 79.92 (0.4) 48.94 (0.3) 81.16 (0.3) 61.60 (0.5)
GREET 80.88 (0.8) 78.22 (1.3) 44.70 (2.8) 77.96 (1.7) 56.42 (1.8)
JoCoR 75.58 (0.8) 71.14 (1.1) 42.82 (1.7) 74.46 (1.1) 52.54 (1.2)
NRGNN 79.14 (0.7) 78.68 (1.0) 52.02 (2.1) 78.66 (1.8) 56.30 (5.9)
MTS-GNN 81.50 (0.8) 79.26 (0.7) 63.34 (3.1) 80.68 (1.8) 62.16 (2.2)
Proposed 82.64 (1.1) 82.20 (0.9) 73.16 (1.0) 81.64 (1.6) 73.46 (3.2)
Citeseer GCN 63.18 (0.6) 56.04 (2.0) 37.74 (0.7) 66.60 (0.9) 40.68 (2.7)
DGI 69.24 (0.2) 63.20 (0.1) 42.14 (0.3) 69.24 (0.3) 45.14 (0.2)
GREET 68.36 (0.9) 62.46 (1.2) 44.50 (1.7) 70.74 (1.5) 46.88 (1.0)
JoCoR 69.06 (1.2) 57.18 (1.4) 38.64 (1.0) 69.54 (0.6) 46.82 (2.4)
NRGNN 66.22 (1.2) 59.66 (1.5) 45.82 (3.2) 67.86 (1.8) 50.10 (2.1)
MTS-GNN 71.94 (1.6) 69.74 (1.1) 53.64 (2.1) 72.34 (1.1) 59.84 (2.8)
Proposed 72.14 (0.7) 70.80 (0.8) 58.18 (3.9) 72.90 (0.5) 63.36 (3.9)
DBLP GCN 80.32 (0.2) 77.54 (1.1) 65.05 (0.6) 80.19 (0.5) 61.56 (1.2)
DGI 79.86 (0.2) 79.28 (0.2) 67.59 (0.2) 78.99 (0.1) 70.29 (0.5)
GREET 80.62 (0.3) 79.83 (0.5) 69.10 (1.0) 81.16 (0.2) 71.41 (1.8)
JoCoR 80.77 (0.3) 78.52 (0.9) 67.32 (1.1) 80.26 (0.3) 64.71 (3.3)
NRGNN 81.89 (0.4) 82.20 (0.3) 76.13 (1.6) 82.01 (0.4) 72.80 (1.2)
MTS-GNN 82.76 (0.2) 82.49 (0.3) 78.82 (1.5) 82.68 (2.4) 70.35 (4.5)
Proposed 83.93 (0.2) 83.87 (0.2) 81.18 (0.3) 83.56 (1.7) 77.32 (0.8)
Photo GCN 88.15 (0.6) 84.51 (0.7) 73.27 (5.9) 88.50 (0.8) 66.20 (1.3)
DGI 87.83 (0.4) 86.78 (0.5) 78.61 (0.9) 85.67 (0.1) 70.71 (0.2)
GREET 88.77 (0.4) 86.39 (0.5) 74.71 (1.2) 88.06 (0.4) 65.66 (0.7)
JoCoR 88.34 (1.1) 84.17 (0.8) 79.22 (2.1) 88.78 (0.5) 70.02 (2.9)
NRGNN 88.30 (0.5) 86.17 (0.6) 77.25 (3.5) 88.61 (0.6) 66.20 (2.4)
MTS-GNN 91.01 (1.4) 88.10 (1.2) 85.12 (1.7) 91.74 (0.3) 81.26 (4.6)
Proposed 91.75 (0.1) 90.06 (1.0) 87.01 (0.9) 91.26 (0.3) 86.96 (2.3)
Computers GCN 86.66 (0.3) 83.18 (0.5) 78.07 (0.4) 84.60 (0.1) 72.91 (0.8)
DGI 83.12 (0.1) 77.71 (0.1) 75.89 (0.3) 81.73 (0.2) 77.71 (0.1)
GREET 84.16 (0.2) 80.75 (0.8) 72.19 (1.2) 83.05 (0.1) 64.35 (0.9)
JoCoR 86.96 (0.4) 83.46 (0.4) 77.99 (0.5) 84.92 (0.4) 73.93 (0.9)
NRGNN 86.51 (0.5) 83.63 (0.5) 77.37 (0.8) 84.34 (0.5) 74.22 (1.1)
MTS-GNN 86.43 (0.2) 84.29 (0.2) 80.65 (0.1) 86.36 (0.4) 76.55 (2.1)
Proposed 85.81 (0.2) 84.89 (0.1) 80.85 (0.3) 86.04 (0.2) 82.38 (1.2)

4.1. Experimental Settings

4.1.1. Datasets

The benchmark datasets in our experiments include three citation datasets (i.e., DBLP Bojchevski and Günnemann (2017), Cora and Citeseer Yang et al. (2016)), and two business datasets (i.e., Computers and Photo Shchur et al. (2018)).

4.1.2. Comparison Methods

The comparison methods include two baseline methods (i.e., GCN Kipf and Welling (2016) and GAT Veličković et al. (2017)), four unsupervised graph representation learning method (i.e., DGI Veličković et al. (2018), GCA Zhu et al. (2021), SUGRL Mo et al. (2022), and GREET Liu et al. (2023b)), and three noisy label methods (i.e., JoCoR Wei et al. (2020), NRGNN Dai et al. (2021), and MTS-GNN Liu et al. (2023a)).

Table 2. Classification accuracy (average accuracy (%) and standard deviation) of the proposed BO-NNC with different components on all datasets at the highest noise rates for the two noise types. Note that the bold number represents the best results in the whole column.
Uniform (60%)
C1 C2 C3 Cora Citeseer DBLP Photo Computers
55.22 (1.0) 44.74 (2.4) 77.57 (0.8) 83.74 (1.3) 79.54 (0.4)
72.50 (1.3) 53.24 (3.0) 80.54 (0.4) 86.23 (1.1) 80.45 (0.6)
58.56 (5.0) 45.28 (2.1) 76.67 (0.6) 84.04 (1.0) 79.30 (2.0)
71.26 (1.4) 56.78 (3.5) 81.25 (0.3) 86.10 (0.7) 80.36 (0.7)
73.16 (1.0) 58.18 (3.9) 81.95 (0.3) 87.01 (0.9) 80.85 (0.3)


Pair (40%) C1 C2 C3 Cora Citeseer DBLP Photo Computers 67.14 (3.1) 51.56 (2.3) 76.05 (1.0) 78.70 (5.2) 79.42 (1.1) 70.52 (6.4) 63.56 (2.0) 74.33 (0.7) 81.08 (1.2) 81.09 (0.6) 65.76 (5.3) 55.62 (2.1) 75.22 (1.8) 79.39 (5.0) 76.43 (2.3) 73.36 (3.5) 62.08 (2.5) 76.01 (1.0) 84.95 (2.2) 81.35 (1.3) 77.50 (2.9) 63.36 (3.9) 77.32 (0.8) 86.96 (2.3) 82.38 (1.2)

Table 3. Classification accuracy (average accuracy (%) and standard deviation) of different multi-teacher distillation methods at 60% noise rates on all datasets. Note that, the bold number represents the best results in the whole column.
Uniform (60%)
Cora Citeseer DBLP Photo Computers
Mean 71.26 (1.4) 56.78 (3.5) 81.25 (0.3) 86.10 (0.7) 80.36 (0.7)
Co-attention 71.82 (0.9) 57.24 (3.2) 81.35 (0.3) 85.66 (1.7) 80.78 (0.6)
Proposed 73.16 (1.0) 58.18 (3.9) 81.95 (0.3) 87.01 (0.9) 80.85 (0.3)


Pair (40%) Cora Citeseer DBLP Photo Computers Mean 73.36 (3.5) 62.08 (2.5) 76.01 (1.0) 84.95 (2.2) 81.35 (1.3) Co-attention 72.96 (0.4) 62.10 (2.7) 77.17 (0.9) 85.08 (1.9) 80.80 (1.6) Proposed 77.50 (2.9) 63.36 (3.9) 77.32 (0.8) 86.96 (2.3) 82.38 (1.2)

4.1.3. Setting-up

We follow the PyTorch Geometric library to split the datasets Cora and Citeseer, and randomly sample 5%, 10%, and 60% of the nodes, respectively, as labelled training set, validation set, and test set for the datasets DBLP, Photo, and Computers. There is no overlap among the training set, the validation set and the test set. We follow the literature Han et al. (2018); Jiang et al. (2018) to add different rates of noisy into the real datasets. Specifically, given a noise rate p𝑝pitalic_p, we first generate noisy labels over all classes according to a noise transition matrix 𝐐c×c𝐐superscript𝑐𝑐\mathbf{Q}\in\mathbb{R}^{c\times c}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_c end_POSTSUPERSCRIPT, where qij=𝐩(y¯=j|y=i)subscript𝑞𝑖𝑗𝐩¯𝑦conditional𝑗𝑦𝑖q_{ij}=\mathbf{p}(\overline{y}=j|y=i)italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = bold_p ( over¯ start_ARG italic_y end_ARG = italic_j | italic_y = italic_i ) is the probability of clean label y𝑦yitalic_y being flipped to the noisy label y¯¯𝑦\overline{y}over¯ start_ARG italic_y end_ARG, and then corrupt the labels of label training set with two types of label noise:

  • \bullet

    Uniform noise The label has p𝑝pitalic_p probability of being randomly flipped to other class. Furthermore, we set noise rates of 20%, 40%, and 60% for the uniform noise.

  • \bullet

    Pair noise Labels are assumed to make mistakes only within the most similar pair classes. More specifically, label has p𝑝pitalic_p probability of being randomly flipped to their pair classes. Furthermore, we set noise rates of 20%, 40% for the pair noise.

To reduce randomness of experimental results, we conduct five experiments with different random seeds and report the average results and corresponding standard deviations. In our experiments, we train three encoders by employing GCA Zhu et al. (2021), DGI Veličković et al. (2018) and SUGRL Mo et al. (2022), plus single layer MLP for every encoder to obtain three teacher classifiers. For a fair comparison, the methods, including JoCoR, NRGNN, MTS-GNN, and the student model of our method use a two-layer GCN as the backbone. In addition, we obtain the source code of all comparison methods from the authors and set the parameters of all comparison methods according to the original literature so that they output the best performance on all datasets.

4.2. Result Analysis

We compare our proposed BO-NNC with all comparison methods on five datsets in terms of node classification tasks with different noise rates and report the results in Table 1.

First, the proposed BO-NNC consistently achieves the best results on all datasets, followed by MTS-GNN, NRGNN, DGI, GREET, JoCoR, GAT, GCA, SUGRL and GCN. The reason is that our BO-NNC enables the student model to obtain the knowledge from multiple teacher models, thus solving the limitations of the single model to effectively deal with noisy labels. Second, our BO-NNC outperforms two baseline methods (i.e., GCN and GAT) and four unsupervised graph representation learning method (i.e., GCA, DGI, SUGRL, and GREET) by a large margin, because the proposed method provides more correct supervised information for the model training by noisy label filtering and pseudo-label selection. For example, the proposed BO-NNC averagely improves by 2.93%, 4.99%, and 13.44% respectively, compared with the DGI that performs best in there six methods, in uniform noisy rates of 20%, 40%, and 60% on all datasets. Third, our MTS-GNN achieves promising improvements, compared with the methods that deal with the noisy label problem, i.e., JoCoR, NRGNN, and MTS-GNN. This indicates that our method fully leverages the complementary information among teacher models through bi-level optimization, and thus enabling the student model to achieve better generalization performance.

Refer to caption
(a) Cora
Refer to caption
(b) Citeseer
Refer to caption
(c) DBLP
Refer to caption
(d) Photo
Refer to caption
(e) Computers
Figure 2. Parameter sensitivity analysis of r𝑟ritalic_r and ρ𝜌\rhoitalic_ρ in our method at 60% noise rate.
Refer to caption
(a) Cora
Refer to caption
(b) Citeseer
Refer to caption
(c) DBLP
Refer to caption
(d) Photo
Refer to caption
(e) Computers
Figure 3. Parameter sensitivity analysis of β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in our method at 60% noise rate.

4.3. Ablation Study

Our method contains three important components, i.e., using the student model to integrate knowledge from multiple teacher models (C1 for short), multi-teacher knowledge distillation based on bi-level optimization (C2 for short), and label improvement (C3 for short). To demonstrate the effectiveness of each component, we report the node classification performance of different component combinations on all datasets at the highest noisy rates in Table 2.

First, our method with all components averagely improves by 5.01%, compared with the methods with one component only. This indicates that all three components are essential for our method, which verifies the feasibility of our proposed method. Second, the overall effectiveness of the method considering the student model is significantly superior to the one without considering the student model, because the student model can explore the complementary information from the teacher models. Last, bi-level optimization needs enough clean node to achieve effectiveness, while label improvement may provide enough clean data. Hence, it is reasonable to simultaneously consider bi-level optimization based multi-teacher distillation and label improvement for dealing with noisy labels.

Refer to caption
Figure 4. Parameter sensitivity analysis of α𝛼\alphaitalic_α in our method at 60% noise rate.

4.4. Effectiveness of Multi-Teacher Construction

We conduct experiments to investigate the effectiveness of our constructed multi-teacher models on all datasets at the highest noisy rates (i.e., 60%) for two types of noise and report the results in Appendix C.

Obviously, the performance of the student model increases with the increase of the number of teacher models. Specifically, compared to the best performance of the methods with only one teacher model, the results of our method with three teacher models improves by an average of 2.68% and 2.49%, respectively, for uniform noise and pair noise on all datasets. This suggests that multi-teacher models provide diverse complementary information to the student models.

4.5. Effectiveness of Bi-level Optimization

We compare our method with the mean method and the co-attention method on all datasets at the highest noisy rates (i.e., 60%) for two noise types and report the results in Table 3, to verify the effectiveness of the proposed bi-level optimization strategy.

Obviously, our proposed method achieves the best results. Specifically, compared with the best comparison method, i.e., the co-attention method, our method improves by on average of 0.86% and 1.88%, respectively, for uniform noise and pair noise on all datasets. This fully demonstrates that our proposed method can more effectively capture the complementary information among teacher models, compared to all comparison methods.

4.6. Parameter Sensitivity Analysis

Our method has five key hyper-parameters, i.e., ρ𝜌\rhoitalic_ρ for pseudo-label selection, r𝑟ritalic_r for noisy label selection, and β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and α𝛼\alphaitalic_α for selecting the clean node set. We investigate the sensitivity of our method to these hyper-parameters at 60% noisy rate, and report the results in Figures 24.

Based on the experimental results, first, our method is not sensitive to the hyper-parameters, including β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and α𝛼\alphaitalic_α, because our bi-level optimization process does not rely on a large number of correct labels. Second, our method is sensitive to the parameters r𝑟ritalic_r and ρ𝜌\rhoitalic_ρ. Specifically, if the value of r𝑟ritalic_r is less than 0.4, the performance of the model gradually increases with the increase of r𝑟ritalic_r. If the value of r𝑟ritalic_r is larger than 0.6, the performance of the model gradually decreases with the increase of r𝑟ritalic_r. The reason is that too many correct labels were removed. , If the value of ρ𝜌\rhoitalic_ρ is less than 70, the variation of ρ𝜌\rhoitalic_ρ has little impact on the model performance. If the value of ρ𝜌\rhoitalic_ρ is larger than 70, the performance of the model gradually decreases with the increase of ρ𝜌\rhoitalic_ρ. The reason is that the incorrect pseudo-label will be inevitably selected.

5. Conclusion

In this paper, we proposed a new NNC method to address the limitations in previous methods. Specifically, multi-teacher distillations transfer diverse information to the learning of the student model. The bi-level optimization strategy fully explores the complementary information among multiple teacher models for the learning of the student model so that requiring less number of correct labels to achieve a robust student model. Experimental results showed that our method achieves supreme performance, compared to the state-of-the-art methods, in terms of noisy node classification tasks.

References

  • (1)
  • Bojchevski and Günnemann (2017) Aleksandar Bojchevski and Stephan Günnemann. 2017. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. arXiv preprint arXiv:1707.03815 (2017).
  • Chen et al. (2022) Can Chen, Xi Chen, Chen Ma, Zixuan Liu, and Xue Liu. 2022. Gradient-based bi-level optimization for deep learning: A survey. arXiv preprint arXiv:2207.11719 (2022).
  • Dai et al. (2021) Enyan Dai, Charu Aggarwal, and Suhang Wang. 2021. Nrgnn: Learning a label noise resistant graph neural network on sparsely and noisily labeled graphs. In KDD. 227–236.
  • Dai and Wang (2021) Enyan Dai and Suhang Wang. 2021. Say no to the discrimination: Learning fair graph neural networks with limited sensitive attribute information. In WSDM. 680–688.
  • Franceschi et al. (2017) Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. 2017. Forward and reverse gradient-based hyperparameter optimization. In ICML. PMLR, 1165–1173.
  • Gao et al. (2021) Yan Gao, Titouan Parcollet, and Nicholas D Lane. 2021. Distilling knowledge from ensembles of acoustic models for joint ctc-attention end-to-end speech recognition. In ASRU. IEEE, 138–145.
  • Gui et al. (2021) Xian-Jin Gui, Wei Wang, and Zhang-Hao Tian. 2021. Towards understanding deep learning from noisy labels with small-loss criterion. arXiv preprint arXiv:2106.09291 (2021).
  • Guo et al. (2019) Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In AAAI, Vol. 33. 922–929.
  • Han et al. (2018) Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. NeurIPS 31 (2018).
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
  • Jiang et al. (2018) Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML. PMLR, 2304–2313.
  • Kim et al. (2021) Kyungyul Kim, ByeongMoon Ji, Doyoung Yoon, and Sangheum Hwang. 2021. Self-knowledge distillation with progressive refinement of targets. In ICCV. 6567–6576.
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Li et al. (2018) Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, Vol. 32.
  • Li et al. (2021) Yayong Li, Jie Yin, and Ling Chen. 2021. Unified robust training for graph neural networks against label noise. In PAKDD. Springer, 528–540.
  • Liu et al. (2021a) Risheng Liu, Jiaxin Gao, Jin Zhang, Deyu Meng, and Zhouchen Lin. 2021a. Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. TPAMI 44, 12 (2021), 10045–10067.
  • Liu et al. (2021b) Risheng Liu, Xuan Liu, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. 2021b. A value-function-based interior-point method for non-convex bi-level optimization. In ICML. PMLR, 6882–6892.
  • Liu et al. (2023a) Yujing Liu, Zongqian Wu, Zhengyu Lu, Guoqiu Wen, Junbo Ma, Guangquan Lu, and Xiaofeng Zhu. 2023a. Multi-teacher Self-training for Semi-supervised Node Classification with Noisy Labels. In ACM MM. 2946–2954.
  • Liu et al. (2023b) Yixin Liu, Yizhen Zheng, Daokun Zhang, Vincent CS Lee, and Shirui Pan. 2023b. Beyond smoothing: Unsupervised graph representation learning with edge heterophily discriminating. In AAAI, Vol. 37. 4516–4524.
  • Mo et al. (2022) Yujie Mo, Liang Peng, Jie Xu, Xiaoshuang Shi, and Xiaofeng Zhu. 2022. Simple unsupervised graph representation learning. In AAAI, Vol. 36. 7797–7805.
  • Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018).
  • Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
  • Veličković et al. (2018) Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2018. Deep graph infomax. arXiv preprint arXiv:1809.10341 (2018).
  • Wei et al. (2020) Hongxin Wei, Lei Feng, Xiangyu Chen, and Bo An. 2020. Combating noisy labels by agreement: A joint training method with co-regularization. In CVPR. 13726–13735.
  • Wu et al. (2022) Lirong Wu, Yufei Huang, Haitao Lin, Zicheng Liu, Tianyu Fan, and Stan Z Li. 2022. Automated Graph Self-supervised Learning via Multi-teacher Knowledge Distillation. arXiv preprint arXiv:2210.02099 (2022).
  • Yang et al. (2016) Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi-supervised learning with graph embeddings. In ICML. PMLR, 40–48.
  • Zhu et al. (2021) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021. 2069–2080.