Aarhus University Seal

Kenneth Borup

Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation

Research output: Contribution to book/anthology/report/proceedingArticle in proceedingsResearchpeer-review

Standard

Even your Teacher Needs Guidance : Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation. / Borup, Kenneth; Andersen, Lars N.

Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. ed. / Marc'Aurelio Ranzato; Alina Beygelzimer; Yann Dauphin; Percy S. Liang; Jenn Wortman Vaughan. Neural Information Processing Systems Foundation, 2021. p. 5316-5327 (Advances in Neural Information Processing Systems, Vol. 7).

Research output: Contribution to book/anthology/report/proceedingArticle in proceedingsResearchpeer-review

Harvard

Borup, K & Andersen, LN 2021, Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation. in MA Ranzato, A Beygelzimer, Y Dauphin, PS Liang & J Wortman Vaughan (eds), Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Neural Information Processing Systems Foundation, Advances in Neural Information Processing Systems, vol. 7, pp. 5316-5327, 35th Conference on Neural Information Processing Systems, NeurIPS 2021, Virtual, Online, 06/12/2021.

APA

Borup, K., & Andersen, L. N. (2021). Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation. In MA. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, & J. Wortman Vaughan (Eds.), Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021 (pp. 5316-5327). Neural Information Processing Systems Foundation. Advances in Neural Information Processing Systems Vol. 7

CBE

Borup K, Andersen LN. 2021. Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation. Ranzato MA, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. In Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Neural Information Processing Systems Foundation. pp. 5316-5327. (Advances in Neural Information Processing Systems, Vol. 7).

MLA

Borup, Kenneth and Lars N. Andersen "Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation"., Ranzato, Marc'Aurelio, Beygelzimer, Alina Dauphin, Yann Liang, Percy S. Wortman Vaughan, Jenn (editors). Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Neural Information Processing Systems Foundation. (Advances in Neural Information Processing Systems, Vol. 7). 2021, 5316-5327.

Vancouver

Borup K, Andersen LN. Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation. In Ranzato MA, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors, Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Neural Information Processing Systems Foundation. 2021. p. 5316-5327. (Advances in Neural Information Processing Systems, Vol. 7).

Author

Borup, Kenneth ; Andersen, Lars N. / Even your Teacher Needs Guidance : Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation. Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. editor / Marc'Aurelio Ranzato ; Alina Beygelzimer ; Yann Dauphin ; Percy S. Liang ; Jenn Wortman Vaughan. Neural Information Processing Systems Foundation, 2021. pp. 5316-5327 (Advances in Neural Information Processing Systems, Vol. 7).

Bibtex

@inproceedings{154ca2a867ea40c08c3cbf50f3b4fe81,
title = "Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation",
abstract = "Knowledge distillation is classically a procedure where a neural network is trained on the output of another network along with the original targets in order to transfer knowledge between the architectures. The special case of self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy. In this paper, we consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets. This allows us to provide the first theoretical results on the importance of using the weighted ground-truth targets in self-distillation. Our focus is on fitting nonlinear functions to training data with a weighted mean square error objective function suitable for distillation, subject to ℓ2 regularization of the model parameters. We show that any such function obtained with self-distillation can be calculated directly as a function of the initial fit, and that infinite distillation steps yields the same optimization problem as the original with amplified regularization. Furthermore, we provide a closed form solution for the optimal choice of weighting parameter at each step, and show how to efficiently estimate this weighting parameter for deep learning and significantly reduce the computational requirements compared to a grid search.",
author = "Kenneth Borup and Andersen, {Lars N.}",
note = "Publisher Copyright: {\textcopyright} 2021 Neural information processing systems foundation. All rights reserved.; 35th Conference on Neural Information Processing Systems, NeurIPS 2021 ; Conference date: 06-12-2021 Through 14-12-2021",
year = "2021",
language = "English",
series = "Advances in Neural Information Processing Systems",
pages = "5316--5327",
editor = "Marc'Aurelio Ranzato and Alina Beygelzimer and Yann Dauphin and Liang, {Percy S.} and {Wortman Vaughan}, Jenn",
booktitle = "Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021",
publisher = "Neural Information Processing Systems Foundation",

}

RIS

TY - GEN

T1 - Even your Teacher Needs Guidance

T2 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021

AU - Borup, Kenneth

AU - Andersen, Lars N.

N1 - Publisher Copyright: © 2021 Neural information processing systems foundation. All rights reserved.

PY - 2021

Y1 - 2021

N2 - Knowledge distillation is classically a procedure where a neural network is trained on the output of another network along with the original targets in order to transfer knowledge between the architectures. The special case of self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy. In this paper, we consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets. This allows us to provide the first theoretical results on the importance of using the weighted ground-truth targets in self-distillation. Our focus is on fitting nonlinear functions to training data with a weighted mean square error objective function suitable for distillation, subject to ℓ2 regularization of the model parameters. We show that any such function obtained with self-distillation can be calculated directly as a function of the initial fit, and that infinite distillation steps yields the same optimization problem as the original with amplified regularization. Furthermore, we provide a closed form solution for the optimal choice of weighting parameter at each step, and show how to efficiently estimate this weighting parameter for deep learning and significantly reduce the computational requirements compared to a grid search.

AB - Knowledge distillation is classically a procedure where a neural network is trained on the output of another network along with the original targets in order to transfer knowledge between the architectures. The special case of self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy. In this paper, we consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets. This allows us to provide the first theoretical results on the importance of using the weighted ground-truth targets in self-distillation. Our focus is on fitting nonlinear functions to training data with a weighted mean square error objective function suitable for distillation, subject to ℓ2 regularization of the model parameters. We show that any such function obtained with self-distillation can be calculated directly as a function of the initial fit, and that infinite distillation steps yields the same optimization problem as the original with amplified regularization. Furthermore, we provide a closed form solution for the optimal choice of weighting parameter at each step, and show how to efficiently estimate this weighting parameter for deep learning and significantly reduce the computational requirements compared to a grid search.

UR - http://www.scopus.com/inward/record.url?scp=85131765610&partnerID=8YFLogxK

M3 - Article in proceedings

AN - SCOPUS:85131765610

T3 - Advances in Neural Information Processing Systems

SP - 5316

EP - 5327

BT - Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021

A2 - Ranzato, Marc'Aurelio

A2 - Beygelzimer, Alina

A2 - Dauphin, Yann

A2 - Liang, Percy S.

A2 - Wortman Vaughan, Jenn

PB - Neural Information Processing Systems Foundation

Y2 - 6 December 2021 through 14 December 2021

ER -