Ghost Programs: Topological Dormancy Signatures in COBOL Legacy Code
F2_cobol_legacy_v1: IRDME on a 14-program COBOL banking application finds r(PERFORM↔COPY)=0.807 (p=0.002) vs r(PERFORM↔data_field_sharing)=0.119 - the Functional Proximity Law holds in 50-year-old procedural mainframe code. Two structurally dormant programs appear as rank #2-#3 hubs in data_field_sharing despite zero PERFORM calls. Rank gap=11 each. Multilayer rank divergence identifies components that are operationally peripheral in execution topology yet central in shared-state topology.
The problem with COBOL is not what you think
COBOL is estimated to run approximately 240 billion lines of production code, processing trillions of dollars in financial transactions annually. The standard account of why COBOL legacy systems are difficult to modernize focuses on programmer scarcity, age, and lack of documentation.
The structural account is different: legacy COBOL systems have a specific architectural pattern that makes certain programs invisible to normal analysis tools while remaining deeply embedded in the codebase's data fabric.
Experiment F2_cobol_legacy_v1 is the first IRDME analysis on a COBOL mainframe application. The finding is not about dead code. It is about a structural divergence signature that identifies execution-isolated programs through topology alone.
Three layers of COBOL coupling
The dataset: 14 COBOL programs from a representative batch-processing banking system. Three relational layers capture three fundamentally different kinds of coupling:
d1 - perform_call_graph (17 edges): Program A explicitly PERFORMs (calls) program B. This is the declared execution structure of a COBOL application - the hierarchy of control flow. In COBOL, if program A is never PERFORMed by anyone, it has no active callers.
d2 - copy_dependency (25 edges): Programs A and B both include the same COPY copybooks - shared record layout definitions (ACCT-REC, TRANS-DATA, ERROR-CODES, AUDIT-TRAIL, HIST-REC). In COBOL, two programs that PERFORM each other must share at least one copybook, because COPY defines the data structures through which they communicate. This is the COBOL inter-program communication contract.
d3 - data_field_sharing (53 edges): Programs A and B both access the same WORKING-STORAGE data items. This is more diffuse than copybook dependency: COBOL's global WORKING-STORAGE allows any program to reference any field in the shared area, regardless of whether it is ever PERFORMed. Programs that were once active but are no longer called retain their WS references.
- Two programs in the dataset were designed to represent execution-isolated modules:
legacy_interest_calc- original interest calculation routine, superseded bybalance_calculatorin a 1998 modernizationdormant_account- dormant account processing, outsourced to a third-party system in 2015
Both programs have degree=0 in perform_call_graph. No program in the dataset PERFORMs them.
Pre-registration
- Four hypotheses were pre-registered before any analysis:
- h1: r(perform_call_graph ↔ copy_dependency) > r(perform_call_graph ↔ data_field_sharing)
- h2: main_control is rank #1 hub in perform_call_graph
- h3: dormant_account is in the top 3 by degree in data_field_sharing
- h4: r(perform_call_graph ↔ copy_dependency) ≥ 0.50 with p < 0.05
Hash c119983a, committed to github.com/vladi160/preregistrations before analysis ran.
Results: 4/4 CONFIRMED
- h1 - FPL in COBOL: CONFIRMED.
- r(perform_call_graph ↔ copy_dependency): Pearson=0.807, Spearman=0.840, permutation p=0.002
- r(perform_call_graph ↔ data_field_sharing): Pearson=0.119 (near-zero, not significant)
- FPL gradient Δr = 0.688
COBOL's global WORKING-STORAGE creates the expected diffuseness in d3: data access patterns are less discriminating than explicit COPY dependencies, because any program can access any field. The 0.807 vs 0.119 split is the structural consequence of this architectural choice.
h2 - main_control as hub: CONFIRMED. main_control has degree=6 in perform_call_graph, rank #1. The driver program orchestrates the batch run - it directly PERFORMs account_inquiry, transaction_processor, report_generator, statement_formatter, system_initializer, and archive_processor.
h3-h4: CONFIRMED. See below.
The structural dormancy finding
The most important result is not the FPL confirmation. It is the rank structure of the data_field_sharing layer.
- Rankings in d3 (data_field_sharing):
- main_control: degree=11
- dormant_account: degree=10
- legacy_interest_calc: degree=9
- report_generator: degree=9
- account_inquiry: degree=8
- Rankings in d1 (perform_call_graph):
- main_control: degree=6
- transaction_processor: degree=5
- account_inquiry: degree=4 ...
- dormant_account: degree=0
- legacy_interest_calc: degree=0
- Cross-layer divergence:
- dormant_account: d1 rank=#13 → d3 rank=#2, rank gap = 11
- legacy_interest_calc: d1 rank=#14 → d3 rank=#3, rank gap = 11
Both programs occupy the last positions in perform_call_graph and near-first positions in data_field_sharing. The rank gap of 11 is the maximum possible in a 14-node graph.
What this is, precisely
A note on framing. The term "dead code" implies a runtime or static reachability claim - that no execution path can ever reach the program. IRDME does not make that claim.
What the analysis identified is more precisely: topological dormancy signatures - components that are operationally peripheral in execution topology yet central in shared-state topology.
Or in the language of the output: dormant_account and legacy_interest_calc are execution-isolated but dependency-central modules. They have no active callers in the dataset, but they remain structurally embedded in the data-sharing fabric of the entire application.
- This distinction matters. A program could have degree=0 in a PERFORM call graph and still be:
- Called from a JCL job step not captured in the graph
- Invoked by a scheduler external to the dataset boundary
- Triggered by a CALL USING statement from outside the 14-program scope
What the topology establishes is the structural signature: zero in control-flow coupling, high in shared-state coupling. This pattern is consistent with programs that were once operationally active (they built up WS data dependencies when they were called) and have since been removed from the execution path without having their data references cleaned up.
In COBOL systems, this is structurally reliable as a candidate dormancy indicator precisely because WS references are not automatically removed when PERFORM calls are. The two layers decouple at different rates. The divergence is the footprint of a program that was once alive.
The architectural reason this works
In modern object-oriented languages, dead code is easier to detect through imports or type references because the dependency system is less global. In COBOL, global WORKING-STORAGE is the coupling medium. Every program can read and write every field.
- This creates two effects that IRDME captures:
- Diffuse d3: r(perform_call_graph ↔ data_field_sharing) = 0.119, near-zero. Active programs and dormant programs are topologically similar in the data layer because anyone can access anything. The data layer has weak discriminating power about execution structure.
- Concentrated divergers: Because the data layer is diffuse but the control layer is sparse, programs that belong in the control layer but have been removed from it stand out as extreme rank-gap outliers. They show up as the nodes that are high in d3 but absent from d1.
This is structurally the same mechanism as the photon hub shadow in the Standard Model (M_PHYSICS_1), but inverted: where the photon is high in force_coupling and absent from decay_channel, the dormant COBOL programs are high in data_field_sharing and absent from perform_call_graph.
What this could be used for
- The topological dormancy signature is a practical analysis pattern: run IRDME on a COBOL codebase with a perform_call_graph layer and a data_field_sharing layer, then identify programs with large positive rank gaps (high d3, low d1). These are candidates for:
- Modernization risk assessment: execution-isolated but data-central programs are migration hazards - removing them requires understanding their WS field usage even if they are never called
- Candidate dormancy identification: a human reviewer can check whether programs with this signature are intentionally inactive or accidentally excluded from the PERFORM chain
- Structural audit: the rank gap magnitude is a prioritization signal for technical debt review
The experiment does not identify these programs as dead code. It identifies the structural conditions under which something like dead code leaves a detectable topological footprint.
Running the experiment
The dataset (cobol_legacy_banking.json, 14 programs, 3 layers, 95 relations) is available on irdme.com/datasets. The full pre-registration record is at github.com/vladi160/preregistrations.
F2 is the first IRDME experiment in the software/legacy domain. The architecture: a multilayer graph where the per-layer degree vector encodes a program's structural role. The FPL tells you which layers are structurally coherent with each other. The rank divergers tell you where that coherence breaks down.
The next experiment is F2_v2: running the same analysis on a real Open Mainframe Project COBOL codebase, where the perform_call_graph is extracted from actual source code rather than modeled from architectural conventions. If the topological dormancy signature replicates in a real codebase, the pattern generalizes beyond the model.