Skip to content

Interpretability & Explainability

Opening the black box — understanding why a model made a prediction, and what it has actually learned inside.

Interpretability & Explainability is one of the core areas in the AI University map of AI. Explore the diagram, then dive into each topic — every subtopic grows into its own deep-dive over time.

flowchart LR
  IN[/Input/] --> MODEL[[Neural network]] --> PRED[/Prediction/]
  MODEL -. attributions .-> WHY[Why this output?]
  MODEL -. circuits .-> WHAT[What did it learn?]

Key topics

  • Feature attribution


    Which inputs mattered? SHAP, LIME, integrated gradients, and saliency maps.

  • Probing representations


    Testing what information is encoded in a model's internal activations.

  • Mechanistic interpretability


    Reverse-engineering circuits and features inside networks — induction heads, superposition, sparse autoencoders.

  • Concept-based explanations


    Explaining models in terms of human-understandable concepts.

  • Global vs local


    Explaining one prediction vs a model's overall behavior.

  • Faithfulness


    The hard question of whether an explanation reflects the true reason for a decision.

Deep Learning · AI Safety, Alignment & Ethics · AI Ethics & Governance


Learn this properly

Want hands-on training in interpretability & explainability? Explore AI University courses and AI School camps for kids.