Linear probe interpretability python. cv2 is a leading computer vision library.
Linear probe interpretability python A suite of interpretability tasks to evaluate agents using Scribe for notebook access - goodfire-ai/scribe-task-suite Linear probes are an interpretability tool used to analyze what information is represented in the internal activations of the Othello-GPT model. cv2 is a leading computer vision library. In advance, you have to select K, the number of features you want to have in your interpretable model. os provides a portable way of using operating system-dependent functionality, e. matplotlib is a library to plot graphs in Python. When a collision occurs (i. Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. On the right, we have the full gradient of model Nov 7, 2024 · We evaluate Logit Lens, Tuned Lens, sparse autoencoders, and linear probes, for these metrics on GPT2-small, Gemma2-2b, and Llama2-7b, comparing them to simpler but uninterpretable baselines of steering vectors and prompting. It provides a comprehensive suite of tools for: Creating and managing datasets for probing experiments Collecting and storing model activations Training various types of probes (linear, logistic, PCA Dec 16, 2024 · These probes can be designed with varying levels of complexity. Contribute to bigsnarfdude/borescope development by creating an account on GitHub. linear probe. We establish foundational concepts such as We contribute the following new insights: We first show that trained linear probes can accurately map the activation vectors of a GPT-model, pre-trained to play legal moves in the game Othello, to the current state of the othello board. Aug 4, 2025 · Master SHAP model interpretability in Python. It's often imported with the np shortcut. The train_test_chess. Python enables data professionals to address concerns about fairness and accountability in AI systems. Interpretability is essential for: Model Discover the fundamentals of linear regression and learn how to build linear regression and multiple regression models using the sklearn library in Python. Utilizing linear probes to decode neuron activations across transformer layers, coupled with causal interventions, this paper underscores the Nov 22, 2022 · When people first think about linear regression they generally expect to see a model that is a straight line plotted through a target variable. Neel Nanda describes a critique of linear probes as 3D: 1) You design what feature you're looking for, not getting the chance to find features from a model-first perspective. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. We establish foundational concepts such as A Python library that encapsulates various methods for neuron interpretation and analysis in Deep NLP models. ipynb The figures displaying training losses for probes and models are made in view_trained_GPT. The basic idea is simple—a classifier is trained to predict some linguistic property from a model’s representations—and has been used to examine a wide variety of models and properties. Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within explainable artificial intelligence that aims to understand the internal workings of neural networks by analyzing the mechanisms present in their computations. py contains the preprocess and feature extraction functions. This is in contrast to the utilization of the non-linear MLP as probes. May 27, 2018 · Testing Linear Regression Assumptions in Python 20 minute read Checking model assumptions is like commenting code. We are not totally confident that our probes do measure their associated concept. sh is the bash file to run the linear probe. For tutorials and more information, visit the github page. They reveal how semantic content evolves across network depths, providing actionable insights for model interpretability and performance assessment. Jul 31, 2025 · Abstract Contextual hallucinations –- statements unsupported by given context –- remain a significant challenge in AI. py --layer 6 --twolayer --mid_dim 64 --championship. In some cases, however, the direction identified by LR can fail to reflect an intuitive best guess for the feature direction, even in the absence of confounding features. This review explores mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. - GitHub - kikaymusic/Neuron-Level-Interpretability-and-Robustness-in-LLMs: A Python library that encapsulates various methods for neuron interpretation and analysis in Deep NLP models. In the future, it would be interesting to use non-linear probes, such as decision trees or simple neural nets to disambiguate non-linear concepts. Linear This work demonstrates a practical and actionable ap-plication of interpretability insights through a generator-agnostic observer paradigm: a linear probe on a trans-former’s residual-stream activations identifies contextual hallucinations in a single forward pass, achieving high F1 scores. Oct 25, 2024 · This guide explores how adding a simple linear classifier to intermediate layers can reveal the encoded information and features critical for various tasks. In the middle, we clip the probe outputs to turn the heatmap into a more binary visualization. It has commentary and many print statements to walk you through using a single probe and performing a single intervention. However, recent studies have chess_llm_interpretability This evaluates LLMs trained on PGN format chess games through the use of linear probes. Linear Probes ("The Lie Detector") - Tests whether concepts are linearly encoded in activation space Causal Abstraction - Tests whether models use expected causal/logical structures internally 6. labo_train. Interpretability Of course, SAEs were created for interpretability research, and we find that some of the most interesting applications of SAE probes are in providing interpretability insight. . Learn about the construction, utilization, and insights gained from linear probes, alongside their limitations and challenges. AI Safety - mechinterp experiments. Mar 6, 2022 · How to Interpret Linear Regression, Lasso, and Decision Tree with Python (easy) What are the most important features (Feature Importance)? Why did the model make this specific decision? 1. Each facet relates to a single column vector k from our probe which classifies color and piece type simultaneously. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. You can find more To visualise probe outputs or better understand my work, check out probe_output_visualization. linear probes etc) can be prone to generalisation illusions. Mar 28, 2023 · Omg idea! Maybe linear probes suck because it's turn based - internal repns don't actually care about white or black, but training the probe across game move breaks things in a way that needs smth non-linear to patch At this point my instincts said to go and validate the hypothesis properly, look at a bunch more neurons, etc. A failure to do either can result in a lot of time being confused, going down rabbit holes, and can have pretty serious consequences from the model not being interpreted correctly. Apr 23, 2024 · Linear probes were originally introduced in the context of image models but have since been widely applied to language models, including in explicitly safety-relevant applications such as measurement tampering. Our results show that while existing methods allow for intervention, they are inconsistent across features and models. In this respect we aim to further improve and refine our dataset. py script can be used to either train new linear probes or test a saved probe on the test set. Jun 17, 2024 · Limitations Interpretability Illusion Interpretability is known to have illusion issues and linear probing doesn’t make an exception. g. 5 days ago · bergson Public Mapping out the "memory" of neural nets with data attribution interpretability influence-functions mechanistic-interpretability data-attribution Python • MIT License Why InterpretML? Model Interpretability Model interpretability helps developers, data scientists and business stakeholders in the organization gain a comprehensive understanding of their machine learning models. , modifying files/folders. This review explores mechanistic interpretability: reverse-engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. main. Sep 19, 2024 · Non-linear probes have been alleged to have this property, and that is why a linear probe is entrusted with this task. chess_llm_interpretability This evaluates LLMs trained on PGN format chess games through the use of linear probes. Oct 12, 2023 · In this research, the emergent world principle is scrutinized in a series of simple transformer models, trained to play Othello and termed as Othello-GPT (Appendix A). ├── src/ # Core implementation │ ├── extraction/ # Hidden state extraction │ ├── probing/ # Linear probing In the current implementations in R and Python, for example, linear regression can be chosen as an interpretable surrogate model. With this package, you can train interpretable glassbox models and explain blackbox systems. ipynb and train_linear_probes_gpt. py are the dataloaders for LaBo and Linear Probe, respectively. Jul 2, 2024 · Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. By training simple linear classifiers to predict specific game features from neural network representations, we can understand what the model "knows" at each layer and how this information evolves Oct 24, 2024 · While linear probes are simple and interpretable, it is unable to disentangle features distributed features that combine in a non-linear way. Finally, good probing performance would hint at the presence of the said property, which has the potential of being used in making final decisions to choose a label in the farthest layer of the neural network. Jan 14, 2021 · Welcome to the post about the Top 5 Python ML Model Interpretability libraries! You ask yourself how we selected the libraries? Well, we took them from our Best of Machine Learning with Python list. e. This page documents the interpretability tools used to analyze Othello-GPT, a transformer-based model trained to play Othello. The AI Explainability 360 Python package includes a comprehensive set of algorithms that cover different dimensions of explanations along with proxy explainability metrics. This probe isolates a single, transferable linear direction separating hallucinated from The scripts to train the Othello models and linear probes are contained in train_gpt_othello. Let's start with all the necessary packages to implement this tutorial. May 17, 2024 · Linear probing is a technique used in hash tables to handle collisions. Probity is a toolkit for interpretability research on neural networks, with a focus on analyzing internal representations through linear probing. Interpretability Illusions in the Generalization of Simplified Models – Shows how interpretability methods based on simplied models (e. InterpretML helps you understand your model's global behavior, or understand the reasons behind individual predictions. For example, if we want to train a nonlinear probe with hidden size 64 on internal representations extracted from layer 6 of the Othello-GPT trained on the championship dataset, we can use the command python train_probe_othello. On the left, we have the actual white pawn location. py is the interface to run all experiments, and utils. ipynb and view_causality_single_tile InterpretML is an open-source package that incorporates state-of-the-art machine learning interpretability techniques under one roof. py and data_lp. To visualise probe outputs or better understand my work, check out probe_output_visualization. Feb 6, 2025 · To assess the interpretability of linear probes trained on SAE latents, we analyze the distribution of their coefficients (Figure 9) and manually inspect the latents with the largest coefficients in the InterProt visualizer 2. , when two keys hash to the same index), linear probing searches for the next available slot in the hash table by incrementing the index until an empty slot is found. Cosine similarity between the column vectors of the main probe weight matrix and the linear combination of column vectors from the color and piece probes. 4️⃣ Training a Probe In this section, we'll look at how to actually train our linear probe. py to train probes. For example, we can visualize where the model \"thinks\" the white pawns are. Jan 31, 2023 · AIX360 The AI Explainability 360 toolkit is an open-source library that supports the interpretability and explainability of datasets and machine learning models. Self-Influence Guided Data Reweighting for Language Model Pre-training] - An application of training data attribution methods to re-weight BUTTER-Clarifier This repository contains a python package of neural network interpretability techniques (interpretability) and a keras callback to easily compute and capture data related to these techniques (we call these values metrics) during training. Model Interpretability: Probing classifiers help shed light on how complex machine learning models represent and process different linguistic aspects. Everybody should be doing it often, but it sometimes ends up being overlooked in reality. Learn local & global explanations, visualizations, and best practices for tree-based, linear & deep learning models. Academic and industry papers on LLM interpretability. This section is less mechanistic interpretability and more standard ML techniques, but it's still important to understand how to do this if you want to do any analysis like this yourself! Jul 31, 2025 · We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. Linear probes are simple, independently trained linear classifiers added to intermediate layers to gauge the linear separability of features. Abstract Understanding AI systems’ inner workings is critical for ensuring value alignment and safety. Official code repository for "Multilingual Safety Mechanistic Interpretability: A Comprehensive Analysis Across 10 Languages and 3 Models" (ICML 2026 Submission) . By isolating layer-specific diagnostics, linear probes inform strategies for pruning, compression, and To visualise probe outputs or better understand my work, check out probe_output_visualization. But when you perform multiple linear regression we Other files: data. We can also perform interventions on the model's internal board state by deleting pieces from its internal world model. Then we will use train_probe_othello. numpy is the main package for scientific computing with Python. ipynb. torch is a deep Mar 13, 2025 · The need for interpretability grows as machine learning impacts more areas of society. Nov 1, 2024 · We can also test the setting where we have imbalanced classes in the training data but balanced classes in the test set. It mitigates the problem that the linear probe itself does computation, even if it's just linear. Introduction In this article, I will try to interpret the Linear Regression, Lasso, and Decision Tree models which are inherently interpretable. ipynb and view_trained_probes. A higher K potentially produces models with higher fidelity. sh are the bash file to train and test LaBo. The lower K, the easier it is to interpret the model. It can also be used to debug models, explain predictions and enable auditing to meet compliance with regulatory requirements. Local Interpretable Model-Agnostic Explanations (lime) ¶ In this page, you can find the Python API reference for the lime package (local interpretable model-agnostic explanations). These tools enable mechanistic interpretability - understanding the inter A common technique in interpretability research for identifying feature directions is training linear probes with logistic regression (LR) 2018. This repo can train, evaluate, and visualize linear probes on LLMs that have been trained to play chess with PGN strings. This work demonstrates a practical and actionable ap-plication of interpretability insights through a generator-agnostic observer paradigm: a linear probe on a trans-former’s residual-stream activations identifies contextual hallucinations in a single forward pass, achieving high F1 scores. I will analyze global interpretability – which analyzes the Nov 1, 2024 · We can also test the setting where we have imbalanced classes in the training data but balanced classes in the test set. common technique in interpretability research for identifying feature directions is training linear probes with logistic regression (LR; Alain & Bengio, 2018). These trained models (Figure 1 a) exhibit proficiency in legal move execution. Othello Game Logic Training Process Mechanistic Interpretability Tools Linear Probing System Probe Architecture Training Probes Intervention Experiments Intervention Methods Attribution via Intervention Data Management Dataset Structure Board State Representation Experiments and Results Circuit Analysis Workflows Visualization Techniques A comprehensive review of mechanistic interpretability, an approach to reverse engineering neural networks into human-understandable algorithms and concepts, focusing on its relevance to AI safety. For example, simple probes have shown language models to contain information about simple syntactical features like Part of Speech tags, and more complex probes have shown models to contain entire Parse trees of sentences. ipynb The intervention studies shown in the pdf are all conducted in the scripts view_causality_complex. All libraries on this best-of list are automatically ranked by a quality score based on a variety of metrics, such as GitHub stars, code activity, used license and other factors. This enhances model interpretability and provides insights into potential biases or limitations. We can check the LLMs internal understanding of board state and ability to estimate the skill level of the players involved. Steering Experiments Train Linear Probes and perform steering Experiments Steps: Construct activations with concept ground truth Train Linear Probes Analyze activation projections on Linear Probe directions Use concept‑specific directions to perturb scGPT activations Analyze steering results Python files (with notebook structure): Apr 4, 2022 · Abstract. With interpretable models, organizations can build more reliable and ethical AI applications that users can trust and understand. sh and a labo_test.