Sleep & Wellness Guide

Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

2026-06-30

Key Takeaway

A robotics research paper on Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?.

Practical Tips

Practical tips and how-to guidance will be added by our editorial team.

中文解读

中文解读待补充:本站将优先为睡眠改善、失眠治疗、助眠方法等高价值文章补充中文说明。

Article Summary

Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when do measurements made on open models allow us to make claims about a closed model? We evaluate surrogate fidelity at the prediction, attribution, and representation levels. For binary classification tasks, log-odds provide an API-compatible scalar readout of the model's representation space, and leave-one-out attributions provide insight into model behavior. Across eleven models spanning four families (Llama, Qwen, GPT, and Gemini), we find that prediction fidelity substantially overstates attribution fidelity: models that agree on what the answer is often disagree on why. We document an access-validity inversion: white-box signals like attention patterns and perturbation magnitudes are highly stable across models but only weakly predictive of causal attributions, which black-box input ablations capture by design. Mechanistic insight does not automatically transfer to closed targets, and prediction-level agreement is insufficient to warrant such transfer. Code and results are available at https://github.com/facebookresearch/surrogate.

5.0Practicality
7.0Scientific Evidence
4.0Effectiveness

Sources & References

Need to track a shipment?

Use our free logistics tracking tool to check real-time delivery status for USPS, FedEx, UPS, DHL, Amazon and 1000+ carriers worldwide.

Track a Package Now

Comments

No comments yet. Be the first to share your thoughts.
Login or register to leave a comment