I plan to use a larger network to distill a smaller one. And I would like to know what is the theoretical upper limit of the strength of a weaker network if it is trained using training data from a stronger network. For example, I train b20c256 with the training data of b28c512nbt, will b20c256 become stronger? If it will, how much stronger will it become? Can it be stonger than b40c256? To say the least, can it be stronger than itself?
I plan to use a larger network to distill a smaller one. And I would like to know what is the theoretical upper limit of the strength of a weaker network if it is trained using training data from a stronger network. For example, I train b20c256 with the training data of b28c512nbt, will b20c256 become stronger? If it will, how much stronger will it become? Can it be stonger than b40c256? To say the least, can it be stronger than itself?