← All posts
AI#transformers#attention-heads#pruning#OAN#taylor-criterion#FPL#topology#t-model-series

T_MODEL: What We Learned About Attention Head Importance in Fine-Tuned Transformers

A full account of the T_MODEL experiment series: we started looking for topology to predict attention head importance, found that activation magnitude (OAN) is a strong gradient-free predictor for binary sentiment classifiers, confirmed that the standard Taylor gradient criterion fails across all three models tested, and discovered that OAN's boundary condition -- where it works and where it doesn't -- may be more informative than the positive result itself.

Where this started

The T_MODEL series grew out of a simple question: if the Functional Proximity Law (FPL) holds in transformer architecture dependency graphs, does it also hold inside a running transformer -- in the attention pattern space?

The broader IRDME project had confirmed FPL across neural circuits, protein interaction networks, software codebases, digital logic, and AI architecture graphs. Transformer models are, among other things, a kind of dependency graph: each attention head routes information, and the topology of that routing might predict which heads are functionally important.

We ran five experiments across three models and two tasks. Here is what we found, including the part that didn't replicate.

Experiment 1: DistilBERT-SST2 (72 heads, binary sentiment)

The primary benchmark compared six head importance methods against an ablation ground truth: zero each head's contribution and measure the drop in logit margin over 80 labeled sentences.

The results were unexpectedly clean:

| Method | Spearman r vs ablation | Mean retention (k=5 to 30) | |---|---|---| | OAN (output activation norm) | +0.633 | 99.0% | | TAM (topology hub score) | +0.132 | 91.2% | | Random | +0.092 | 92.1% | | Taylor (gradient x activation) | -0.207 | 68.7% | | AttNorm | -0.378 | 65.8% | | EntInv | -0.346 | 70.1% |

Two things stood out. First, Output Activation Norm (OAN) -- the mean L1 magnitude of each head's pre-projection output vector -- achieved r = +0.63 against ablation ground truth with zero gradients and zero labels. One forward pass per sentence, one hook on the output projection layer. That's the whole method.

Second, the Taylor first-order criterion -- the standard gradient-based approach for attention head pruning -- was inverted: r = -0.21. It identified the most important heads as least important. At k=30 (42% of heads pruned), following Taylor's recommendation left only 53.5% of logit margin. Random pruning retained 84.5%.

Why Taylor inverts in converged models

    A gradient decomposition revealed the mechanism. Computed separately:
  • r(gradient magnitude, ablation importance) = -0.47
  • r(activation magnitude, ablation importance) = +0.74

The activation signal is strongly positive. The gradient signal is strongly inverted. Their product -- Taylor -- inherits a net negative signal.

The explanation is straightforward once you see it. Taylor was designed for pruning during training, where large gradient magnitude means the optimizer is actively adjusting a parameter -- it is sensitive, it matters. In a converged fine-tuned model, this logic inverts. Important heads have settled to a local minimum: the loss is no longer sensitive to small perturbations of their output. Their gradients are near zero. Unimportant, residually unstable heads still have large gradients -- they haven't settled yet. Taylor measures optimization drift, not output contribution. OAN measures output contribution directly.

This is not an implementation problem. We verified the hooks fire on all 72 heads, gradient magnitudes are non-zero (mean 0.00023, CV = 0.554). The inversion is real.

Experiment 2: BERT-base-SST2 (144 heads, same task, different architecture)

Replication on textattack/bert-base-uncased-SST-2: 12 layers, 12 heads, 144 heads total. Hook path adapted from out_lin (DistilBERT) to attention.output.dense (BERT). Everything else identical.

| Method | OAN Spearman r | Mean retention | |---|---|---| | DistilBERT-SST2 | +0.633 | 99.0% | | BERT-base-SST2 | +0.491 | 99.9% |

OAN retention is essentially identical: 99.0% vs 99.9% despite double the heads and a different architecture. Taylor on BERT-base is +0.117 -- near-random rather than inverted, consistent with the convergence-depth theory: the textattack BERT-base model may be less deeply converged than the official HuggingFace DistilBERT fine-tune.

Experiment 3: BERT-base-AG-News (144 heads, 4-class topic classification)

The cross-task test: same BERT-base architecture, same textattack model family, but a different task -- 4-class news categorization (World, Sports, Business, Science/Technology).

OAN Spearman r = -0.038. Near zero. Taylor: -0.169. No method reliably ranked heads by importance.

One striking observation: TAM and AttNorm showed retention above 100% at aggressive pruning budgets (118% for TAM at k=40, 119% for AttNorm). Pruning the lowest-ranked heads by those metrics improved the model's logit margin. The AG News model contains heads that are actively harmful -- a regime not present in the SST-2 models where both were 100% accurate on our test sentences.

Experiment 4: Per-class OAN on AG News

The obvious hypothesis: cross-category averaging destroys the signal. Compute OAN and ablation using only World-news sentences, then only Sports sentences, and so on.

    Per-class mean OAN Spearman: -0.036. Essentially identical to the cross-class result.
  • World: +0.093
  • Sports: -0.090
  • Business: -0.089
  • Sci/Tech: -0.059

The failure is not caused by averaging across heterogeneous categories. OAN does not predict head importance within AG News even for a single category evaluated in isolation.

The boundary condition: what we don't know

The question is now: why does OAN work on sentiment (both architectures) and fail on topic classification?

Two candidate explanations are currently confounded:

Task structure (binary vs. multi-class): Binary classification requires a small set of heads to activate consistently for all examples in one direction. Multi-class classification can distribute the task -- different heads specialize for different category vocabularies. No single head needs high output magnitude across all categories.

Convergence depth: Both SST-2 models achieved 100% accuracy on our 80 test sentences with high margins (7-8). The AG News model achieved 90% with lower margins (4.1). Less convergence might mean more distributed head roles and a weaker OAN signal.

A controlled experiment would resolve this: train a binary topic classifier (World vs. Sports) on the same model architecture and test OAN. If OAN recovers on a binary topic task, the binary structure is the key variable. If it doesn't, convergence depth is.

That experiment is pending.

What is established

Across all five runs, three models, two tasks:

    Established:
  • OAN is a strong predictor for binary sentiment classification (r = +0.49 to +0.63, 99%+ retention)
  • Taylor fails across all three models and both tasks (inverted, near-random, or weakly inverted)
  • The gradient decomposition explains Taylor failure: activation magnitude is positive, gradient magnitude is negative, their product inherits the inversion
  • OAN does not predict head importance for 4-class topic classification (per-class confirmed)
    Not established:
  • OAN as a universal cross-task predictor
  • Whether binary structure or convergence depth is the active variable in the boundary condition

Connection to IRDME's core work

The TAM result -- topology (attention pattern hub scores) achieves r = +0.13 at the head level but r = +0.57 at the layer level -- is a direct confirmation of FPL in transformer attention space. Structural centrality localizes task importance to the right layers (later layers in DistilBERT carry SST-2 signal). But within a layer, output magnitude (OAN) predicts which specific head matters, not topology.

This is consistent with the measurement-proximity principle visible across IRDME domains: the closer your metric is to the actual functional output of a system node, the better it predicts that node's importance. Topology gives you the coarse structural map. Direct measurement gives you the fine-grained answer.

All code is available at github.com/vladi160/the-beginning/topology-model/. The paper (exploratory, not pre-registered) is in preparation.