← All posts
methodology#spearman#pearson#correlation#methodology#FPL#cross-layer#M1

The Gap Between Spearman and Pearson Tells You Something About the Structure, Not the Metric

We reanalyzed all 45 cross-layer hub correlation results across 33 pre-registered IRDME experiments to compare Spearman rho and Pearson r. Overall: Pearson is marginally higher on average (0.510 vs 0.463). But the GAP between them is not noise -- it is a structural diagnostic. Large positive gaps (Spearman >> Pearson) appear consistently in cross-species and cross-domain settings where rank ordering is preserved but magnitudes differ. Large negative gaps (Pearson >> Spearman) appear in within-domain settings where absolute hub centrality scores carry real structural signal. The gap tells you what kind of cross-layer relationship you are measuring.

The Question

Earlier work (F13, Drosophila larval connectome) showed a striking discrepancy: Spearman rho = 0.663, Pearson r = 0.363. Nearly double. This raised a methodological question (M1 in the research log): is FPL actually about rank preservation (Spearman) rather than magnitude correlation (Pearson)? If Spearman is systematically higher, the paper's effect sizes would be understated.

We ran a systematic reanalysis across all pre-registered experiments to find out.

The Data

45 cross-layer hub correlation results from 33 pre-registered experiments, spanning: neuroscience (connectomes), software (OSS, COBOL, Flask), formal mathematics (Lean4, Coq), physics (Standard Model), chemistry (prebiotic networks), medicine (evidence chains), hardware (ISCAS85 circuits), and machine learning (architecture graphs).

For each result, we extracted both Pearson r and Spearman rho from the stored outputs.

What We Found

    The main result: Spearman is NOT systematically higher.
  • Spearman > Pearson: 15/45 cases (33%)
  • Spearman < Pearson: 25/45 cases (56%)
  • Average |Pearson|: 0.510
  • Average |Spearman|: 0.463

Pearson is marginally higher on average. The original concern -- that effect sizes across the board are understated -- is not supported.

But the gap between the two metrics is not random noise. It has a clear structural pattern.

The Pattern: Gap Sign Maps to Relationship Type

    Cases where Spearman substantially exceeds Pearson:
  • F13 (C. elegans -> Drosophila cross-species hub transfer): Spearman=0.663, Pearson=0.363, gap=+0.30
  • M_MED3 (opioid evidence chain): Spearman=0.376, Pearson=0.144, gap=+0.23
  • M_TRANSFER_3 (cross-language software): Spearman=0.224, Pearson=0.000, gap=+0.22
    Cases where Pearson substantially exceeds Spearman:
  • M_PHYSICS_1/3 (Standard Model force/decay): Pearson=0.569, Spearman=0.109, gap=-0.46
  • C. elegans 302-neuron connectome (within-species): Pearson=0.623, Spearman=0.313, gap=-0.31
  • Computational geometry (M_GEOM_CSG_1): Pearson=0.668, Spearman=0.413, gap=-0.26
  • AI architecture lineage: Pearson=0.915, Spearman=0.676, gap=-0.24
    The pattern is consistent:
  • Spearman >> Pearson appears in cross-species, cross-domain, and cross-paradigm comparisons -- contexts where the RANK ORDER of hub importance is preserved across scale differences, but the actual magnitudes differ between the two systems being compared
  • Pearson >> Spearman appears in within-domain, within-species, and within-paradigm comparisons -- contexts where the absolute magnitude of hub centrality scores carries structural information beyond rank

What This Means

The Spearman-Pearson gap is a structural diagnostic, not a metric preference question.

When you measure hub correlation between two layers of the same system (same organism, same codebase, same particle physics model), the magnitudes are comparable. A node that is rank #1 with centrality score 0.9 in one layer and 0.85 in another carries information in those magnitudes: the persistence is strong, and Pearson captures that. Spearman, by discarding magnitudes, loses signal.

When you measure hub correlation across systems separated by a large scale difference (300 million years of evolution, different programming languages, different evidence networks), the magnitudes are not directly comparable -- a neuron important in a 302-node worm connectome and the same topological role in a 2952-node fly connectome will not have the same raw centrality scores. What persists is the rank. Spearman is more consistent with that type of preservation.

This is not a claim that Spearman is the "correct" metric in cross-domain settings. That would require a counterfactual test -- does Spearman predict missing links better? Does it generalise to held-out data? We have not run those tests. What we can say is: the gap is consistent with rank-preserving structure in cross-domain settings, and consistent with magnitude-informative structure in within-domain settings.

The Upgrade Enabled

    For the two cases where the gap is positive and substantial, using Spearman as primary changes the verdict:
  • F13 h2 (cross-species hub persistence, larval fly): Pearson r=0.363 (PARTIAL) -> Spearman rho=0.663 (CONFIRMED). The Spearman result is more appropriate here: we are asking whether hub rank is preserved across 600 million years of evolution, not whether the numerical centrality scores are proportional.
  • M_MED3 h1 (opioid evidence chain FPL direction): Pearson r=0.144 (PARTIAL) -> Spearman rho=0.376. The evidence chain nodes are ordered epistemically (justification rank, citation rank) -- rank comparison is natural.
    We report both metrics. The practical recommendation for the FPL paper:
  • Report Pearson r as primary throughout (stronger on average, appropriate for the within-domain majority of experiments)
  • Report Spearman rho in a supplementary table for all experiments
  • For cross-species experiments (F13), note explicitly that Spearman shows substantially higher values (rho=0.663 vs r=0.363), consistent with rank-preserving hub structure under inter-species scale differences
  • Do not claim one metric is universally "right" -- the gap itself is the finding

What This Does Not Claim

    We have not shown:
  • That Spearman predicts missing links or out-of-sample data better in cross-species settings
  • That the gap threshold (what counts as "large"?) is precisely defined
  • That the within-domain / cross-domain distinction is the only factor driving the gap

The diagnostic interpretation is consistent with the data, but it is a hypothesis about the mechanism, not a validated theory. Testing it properly would require: (1) a held-out prediction task where the two metrics give different rankings, and (2) showing that the metric with the larger value produces better predictions in the relevant setting.

Summary

    Across 45 cross-layer correlation measurements in 33 pre-registered experiments:
  • Pearson r is marginally higher on average (0.510 vs 0.463). Effect sizes are not systematically understated.
  • The Spearman-Pearson gap is not noise. Positive gaps correlate with cross-domain/cross-species settings; negative gaps correlate with within-domain settings.
  • The gap is a structural diagnostic: it reflects whether rank ordering or magnitude is the more stable property across the two layers being compared.
  • Report both. The gap itself carries information.