Between infancy and adulthood, the number of synapses in our brain first multiply and then fall.
Synaptic pruning improves efficiency by removing redundant neurons and strengthening synaptic connections
that are most useful for the environment.
Despite losing 50 % of all synapses between age two and ten, the brain continues to function
In 1990, a popular paper was published titled Optimal Brain Damage
At face value, pruning does appear to promise you can can (almost) have it all.
State of art pruning methods remove the
majority of the weights with minimal degradation to top-1 accuracy
However, the ability to prune networks with seemingly so little degradation to generalization performance is puzzling. The cost to top-1 accuracy appears minimal if it is spread uniformally across all classes, but what if the cost is concentrated in only a few classes? Are certain types of examples or classes disproportionately impacted by pruning?
An understanding of these trade-offs is critical when deep neural networks are used for sensitive tasks such
as hiring
In this work, we propose a formal framework to identify the classes and images where there is a high level of disagreement or difference in generalization performance between pruned and non-pruned models. We find that certain examples, which we term pruning identified exemplars (PIEs), and classes are systematically more impacted by the introduction of sparsity.
The primary findings of our work can be summarized as follows:
PIEs are images where the most frequent prediction differs between a population of independently trained pruned and non-pruned models. We focus on open source research datasets such as ImageNet and find that PIE images are more challenging for both pruned and non-pruned models. Restricting the test-set to a random sample of PIE images sharply degrades top-1 accuracy. Removing PIEs from the test-set improves top-1 accuracy for both pruned and non-pruned models.
To better understand why PIEs are more sensitive to capacity, we conducted a limited human study (85 participants) and find that PIEs from the ImageNet test set are more likely to be mislabelled, depict multiple objects or require fine-grained classification.
Over half of all PIE images were classified by human participants as either having an incorrect ground truth label or depicting multiple objects. The over-indexing of poorly structured data hints that the explosion in number of parameters for single image classification tasks like ImageNet may be solving a problem that is better addressed in the data cleaning pipeline.
On real world datasets, the stakes are often much higher than correctly classifying a paddle or guacamole. For sensitive tasks such as patient risk stratification or medical diagnoses
PIE provides one tool to become more familiar with the underlying data by surfacing
a far smaller subset of examples the model finds challenging to the human expert. This can be extremely valuable for creating human-in-the-loop decisions,
where certain atypical examples are re-routed for human inspection
ImageNet has 1000 different class categories, which include both every day objects such as cassette player and more nuanced categories that refer to the texture of an object such as velvet or even person types such as groom. If the impact of pruning was uniform across all classes, we would expect the model accuracy on each class to shift by the same number of percentage points as the difference in top-1 accuracy between the pruned and non-pruned model.
This forms our null hypothesis, and we must decide for each class whether to reject the null hypothesis and accept the alternative -- the change to class level recall differs from the change to overall accuracy in a statistically significant way. This amounts to asking -- did the class perform better or worse than expected given the overall change in top-1 accuracy after pruning?
Evaluating whether the difference between a sample of mean-shifted class accuracy from pruned and non-pruned models is
“real” can be thought of as determining whether two data samples are drawn from the same underlying distribution, which is the subject of a large body of goodness of fit literature
To compare class level performance between pruned and non-pruned models, we use a two-sample, two-tailed,
independent Welch’s t-test
The directionality and magnitude of the impact of pruning is nuanced and surprising. Our results show that certain classes are relatively robust to the overall degradation experienced by the model whereas others degrade in performance far more than the model itself. This amounts to selective brain damage with performance on certain classes evidencing far more sensitivity to the removal of model capacity.
The classes that experience a significant relative decrease in accuracy are fewer at every level than those that recieve a relative boost, however the magnitude of class decreases is larger than the gains (which pulls overall accuracy downwards). This tells us that the loss in generalization caused by pruning is far more concentrated than the relative gains, with fewer classes bearing the brunt of the degradation caused by weight removal.
At higher levels of pruning, more classes are impacted and the absolute % difference widens between the classes most and least impacted. Most real world applications of pruning tend to prune above 50% in order to gain the returns in memory and efficiency. When 90% of the weights are removed, the relative change to 582 out of 1000 ImageNet classes is statistically significant.
Pruned models are widely used by many real world machine learning applications. Many of the algorithms on your phone are likely pruned or compressed in some way. Our results are surprising and suggest that a reliance on top-line metrics such as top-1 or top-5 test-set accuracy hides critical details in the ways that pruning impacts model generalization.
However, our methodology offers one way for humans to better understand the trade-offs incurred by pruning and gain intuition about what classes benefit the most from additional capacity. We believe this type of tooling is a valuable first step to help human experts understand the trade-offs incurred by pruning and surface challenging examples for human judgement.
We welcome additional discussion and code contributions on the topic of this work. A comprehensive introduction of the methodology, experiment framework and results can be found in our paper and open source code. There is substantial ground we were not able to address within the scope of this work, and underserved areas worthy of future consideration include evaluating the impact of pruning on additional domains such as language and audio, a consideration of different architectures and a comparison of the relative trade-offs incurred by pruning methods with other popular compression techniques such as quantization.
Visiting the NeurIPS Google Booth?
Take a look at our demo slides here.
A special thank you is due to James Wexler, Keren Gu-Lemberg and Prajit Ramachandran for some helpful suggestions about how to visualize and communicate our results in an interactive format. This article was in part prepared using the Google AI Pair template and style guide. The citation management for this article uses the template v1 of the Distill style script.
We thank the generosity of our peers and colleagues for valuable feedback on earlier versions of this work. In particular, we would like to acknowledge the valuable input of Jonas Kemp, Simon Kornblith, Julius Adebayo, Dumitru Erhan, Hugo Larochelle, Nicolas Papernot, Catherine Olsson, Cliff Young, Martin Wattenberg, Utku Evci, James Wexler, Trevor Gale, Melissa Fabros, Prajit Ramachandran, Pieter Kindermans, Moustapha Cisse, Erich Elsen and Nyalleng Moorosi.
We thank the institutional support and encouragement of Dan Nanas, Rita Ruiz, Sally Jesmonth and Alexander Popper.
@article{hooker2019selective, title={Selective Brain Damage: Measuring the Disparate Impact of Model Pruning}, author={Sara Hooker and Aaron Courville and Yann Dauphin and Andrea Frome}, year={2019}, url={https://arxiv.org/abs/1911.05248}, eprint={1911.05248}, archivePrefix={arXiv}, primaryClass={cs.LG} }