NeurIPS 2026 Submission

Linear Mode Connectivity Barriers
are Logit Linearization Gaps

Hanbin Bae

Under Review

Abstract

We give an exact pointwise decomposition of the cross-entropy barrier on the linear chord between any two classifiers. Writing $E(\alpha;x)$ for the per-example logit linearization error along $\gamma(\alpha)=(1{-}\alpha)\theta_A+\alpha\theta_B$,

$L(\gamma(\alpha))-(1{-}\alpha)L(\theta_A)-\alpha L(\theta_B) \;=\; \mathbb{E}_x\!\left[\,D - J - E_y\,\right]$

where $D$ is the LSE deviation, $J\ge 0$ an endpoint Jensen gap, and $E_y$ the true-label coordinate of $E$. The decomposition is purely algebraic—no Hessian, no projection, no alignment—and shows that the logit linearization error $E$ is the unique channel through which any barrier arises. Two one-line corollaries from LSE convexity bound $B_{\mathrm{lin}}$ by forward-pass functionals of $E$. We verify the decomposition on 36 endpoint pairs across six architecture classes (including ViT-Base and GPT-2) and demonstrate two mechanistic consequences: Hungarian weight matching reduces $B_{\mathrm{lin}}$ precisely because it shrinks $E$, and wide networks have smaller normalized barriers because $E$ vanishes in the near-linear regime. The decomposition also enables per-example barrier attribution: the top 35% of examples contribute half the raw barrier, and alignment makes the residual gap more concentrated while reshuffling which classes dominate it.

Key Results

The decomposition holds exactly across six architecture classes, 36 endpoint pairs, and over 2.5 orders of magnitude in barrier scale.

1.0005

Median identity ratio
(exact to 10⁻⁴)

Endpoint pairs verified
across 6 architectures

3.18×

Barrier reduction from
Hungarian alignment

14×

$B_{\mathrm{lin}}/L_2^2$ drop
in width sweep

Core Intuition

When a network is linear in its parameters, logits interpolate linearly and the barrier is zero. When it is nonlinear, the logit trajectory bows away from linear interpolation, producing $E\neq 0$—and the LSE convexity gap drives the barrier.

Conceptual diagram showing how logit linearization error E drives the LMC barrier

Figure 1. Top row (faded): when $f_\theta$ is linear in $\theta$, the logit trajectory equals the linear interpolation, so $E\equiv 0$ and $B_{\mathrm{lin}}=0$. Bottom row (vivid): when $f_\theta$ is nonlinear, the logit trajectory bows away, producing $E\neq 0$. The shaded red region is the barrier, exactly $\mathbb{E}_x[D-J-E_y]$.

The Decomposition Identity

The barrier decomposes into three forward-pass terms. No Hessian, no alignment, no projection is used. The identity holds as an algebraic equality—verified to $10^{-4}$ on every pair.

Proposition (Pointwise Identity). For any cross-entropy classifier, any two endpoints $\theta_A, \theta_B$, and every $\alpha\in[0,1]$:

$\displaystyle B(\alpha) = \mathbb{E}_{(x,y)}\!\bigl[\,D(\alpha;x) - J(\alpha;x) - E_y(\alpha;x)\,\bigr]$

where $D = \mathrm{LSE}(z_\gamma) - \mathrm{LSE}(z_{\mathrm{lin}})$, $J = (1{-}\alpha)\mathrm{LSE}(z_A) + \alpha\,\mathrm{LSE}(z_B) - \mathrm{LSE}(z_{\mathrm{lin}}) \ge 0$, $E_y = [z_\gamma - z_{\mathrm{lin}}]_y$.

Identity verification across five architecture classes

Figure 2. Five endpoint pairs from five architecture classes. Solid curves: measured chord barrier. Dashed curves: the forward-pass quantity $\mathbb{E}_x[D-E_y]$. The predicted curves overlay the measured barrier without any Hessian, alignment, or projection.

Universality Across Architectures

Tested on six architecture classes spanning vision and language: mod-$P$ transformers, sparse-parity MLPs, CIFAR-10 MLPs, ResNet-18, ViT-Base (85M), and GPT-2 (124M).

Architecture	Pairs	$B_{\mathrm{lin}}$ range (nats)	Identity ratio	$\\|E\\|_\infty$ ratio
Mod-$P$ Transformer	10	0.35 – 4.85	1.0002 – 1.0056	3.8 – 37.7×
Sparse-parity MLP	10	0.27 – 85.6	0.9998 – 1.0007	1.2 – 13.6×
CIFAR-10 MLP	5	2.16 – 2.31	1.028 – 1.039	23.2 – 24.8×
CIFAR-10 ResNet-18	5	3.18 – 4.19	1.0000	6.7 – 9.0×
Tiny ImageNet ViT-Base	3	0.25 – 0.33	1.186 – 1.255	9.8 – 11.9×
WikiText-2 GPT-2 (124M)	3	2.06 – 2.26	1.0000	5.1 – 5.4×

Scatter plot of forward-pass bounds vs measured barrier

Figure 3. (A) The upper-bound form lies near the diagonal (median ratio 1.0005). (B) The Lipschitz relaxation sits above the diagonal; tightest on binary-class tasks (~1.2×), loosest on CIFAR-10 MLPs (~25×).

Permutation Alignment Shrinks $E$

Hungarian weight matching reduces the barrier by 3.18× on CIFAR-10 MLPs. The decomposition tracks both sides: alignment works precisely because it shrinks the logit linearization error $E$.

Alignment effect on barrier and logit error

Figure 4. (A) Loss profile along the chord for raw (red) and aligned (blue) pairs. (B) Mean sup-norm $\mathbb{E}_x\|E\|_\infty$ along the chord; alignment reduces it uniformly. (C) The forward-pass identity overlays the measured barrier on both versions.

Width and the Near-Linear Regime

As width grows, $E$ shrinks relative to chord length and the normalized barrier $B_{\mathrm{lin}}/L_2^2$ decreases monotonically by 14× from width 64 to 4096.

Width sweep showing barrier normalization

Figure 5. CIFAR-10 MLP width sweep ($W\in\{64, \ldots, 4096\}$). (A) Raw $B_{\mathrm{lin}}$ is non-monotone because wider networks travel farther in parameter space. (B) Normalized $B_{\mathrm{lin}}/L_2^2$ decreases monotonically by 14×. (C) Chord-mean $\|E\|_\infty/L_2$ decreases similarly.

Barrier Attribution

The decomposition holds per-example, enabling barrier attribution: which inputs, classes, or subpopulations account for the merge gap?

Figure 6. Barrier attribution on a CIFAR-10 MLP pair. (A) Lorenz curve: the top 35% of examples account for 50% of the raw barrier; after alignment the residual is more concentrated. (B) Slope chart: class contributions reshuffle—“car” drops from 15% to 5%, “cat” rises from 10% to 20%. (C) Per-example scatter showing alignment reshuffles rather than uniformly compresses.

~35%

Examples contributing
50% of barrier

3×

Class contribution
ratio (max / min)

ρ = 1.00

3-pass midpoint score
ranks pairs perfectly

Citation

@inproceedings{bae2026lmc,
  title     = {Linear mode connectivity barriers are logit linearization gaps},
  author    = {Bae, Hanbin},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2026}
}

Linear Mode Connectivity Barriersare Logit Linearization Gaps