AI#transformers#attention-heads#pruning#OAN#taylor-criterion#FPL#topology#t-model-series
T_MODEL: What We Learned About Attention Head Importance in Fine-Tuned Transformers
A full account of the T_MODEL experiment series: we started looking for topology to predict attention head importance, found that activation magnitude (OAN) is a strong gradient-free predictor for binary sentiment classifiers, confirmed that the standard Taylor gradient criterion fails across all three models tested, and discovered that OAN's boundary condition -- where it works and where it doesn't -- may be more informative than the positive result itself.