Methods / Machine Learning Tutorial
Implementing explainable AI with Random Forest and SHAP
A guide for researchers who have finished the Python Vibe Coding Course and want to take the next step — actually implementing machine learning. This page introduces a hands-on sample notebook that uses Random Forest and SHAP (Shapley values) on agricultural biochar data, and walks through how to run it on your own machine. The aim is to move from "writing code" to "doing machine learning for science."
The most important thing in learning machine learning is to understand the ideas before memorizing the syntax. An LLM can write code for you, but reading the results — separating signal from noise, understanding what the model is actually doing — only comes from books.
The two books below are both freely available, world-class online textbooks, and both relate directly to the sample notebook. Please read through them before running the code.
Supervised Machine Learning for Science
Reframes machine learning not as a "prediction tool" but as a scientific tool that integrates interpretability, causality, and uncertainty quantification. Lays out the basic mindset you should bring whenever you use ML in research.
Interpretable Machine Learning
A comprehensive textbook on techniques for making black-box models explainable. Without grasping the theoretical meaning of SHAP (Shapley values), the figures produced by the sample code will only feel "vaguely important." At minimum, read the SHAP chapter.
The sample notebook produces results in just a few minutes if you only want to "make it run." But that path leads only to "feeling like you've done machine learning" — not to actually doing it. If you plan to use this in research, please go through the two books first to ground yourself in the ideas.
The sample uses a dataset of biochar (charcoal-like soil amendment) properties from agricultural research. Biochar properties depend strongly on raw material, pyrolysis temperature, treatment time, and other production parameters. The notebook trains a Random Forest regression model from these production parameters and uses SHAP values to visualize which conditions matter, by how much, and in which direction.
1. Load the CSV data (biochar properties)
2. Define features and target
3. Train a Random Forest regression model
4. Compute SHAP values
5. Visualize the results (Summary plot, Dependence plot, etc.)
The entire workflow lives in a single Jupyter Notebook (.ipynb). You only need to run the cells from top to bottom.
This section assumes you have already finished Part 2 (Environment Setup) of the Python Vibe Coding Course. If you have not yet installed VSCode, Python, or set up a virtual environment, please go through that part first.
Open the GitHub repository page, click the green "Code" button, and choose "Download ZIP." Unzip the archive somewhere reasonable on your machine (e.g., C:\dev\RandomForestSHAP).
Putting the folder under OneDrive or in a deeply nested path can trigger Windows' "path too long" error. See the supplement on the Windows Long Path problem if you run into trouble.
Launch VSCode and use "File → Open Folder" to open the unzipped folder. Then open a terminal (top menu: "Terminal → New Terminal") and create a virtual environment.
# Create the virtual environment
python -m venv .venv
# Activate (Windows / PowerShell)
.venv\Scripts\Activate.ps1
# Activate (Mac / Linux)
source .venv/bin/activate
With (.venv) showing at the start of your prompt, install the required libraries in one go:
pip install pandas numpy scikit-learn shap matplotlib seaborn jupyter openpyxl
Once installation finishes, write out the versions for reproducibility:
pip freeze > requirements.txt
From VSCode's left sidebar, double-click BiocharRandomForestSHAP.ipynb. In the top-right corner, click "Select Kernel" and choose your .venv environment. Then run each cell from top to bottom (click the ▶ button on the left of each cell, or press Shift + Enter).
Selecting the kernel works exactly the same way as in Part 2 / Step 7 of the Python course. If you forget to pick the virtual environment, you'll get "module not found" errors for all the libraries you just installed.
Roughly speaking, a SHAP value tells you "how much a given feature pushed the prediction away from the average, and in which direction." For a first pass, three plot types are enough:
For the deeper interpretation — and important caveats like "high SHAP importance does not mean causal effect" — please refer to the SHAP chapter of Interpretable Machine Learning.
Once the sample runs, you'll naturally want to try it on your own data. In most cases you only need to change the data-loading section and the feature/target column names. Concretely, look at three places:
pd.read_csv(...) argument — change the file path to point at your CSV.X = df[[...]]) — replace with your own column names.y = df[...]) — replace with the column you want to predict.
"Please modify the code in BiocharRandomForestSHAP.ipynb to work with my own dataset my_data.csv. The columns in my data are: [list of columns]. The target variable I want to predict is [column name]."
If your dataset is small (a few dozen to a few hundred rows), watch out for overfitting with Random Forest. For cross-validation and hyperparameter tuning, refer back to the relevant chapters of Supervised Machine Learning for Science.
Questions and suggestions are welcome via Contact (tomoakiyamaguchirice@gmail.com) or via GitHub Issues.
Running the code and seeing colorful plots can feel like understanding. But if you can't explain what each plot means in your own words, it isn't research — it's decoration. Machine learning and its interpretation methods, used without grasping their theoretical foundations, lead directly to wrong conclusions.
If you intend to use these methods in real research, read the two textbooks introduced at the top of this page, and read them repeatedly. Not once and done. Keep returning to them as you write code and look at results. In an era where LLMs can write the code, the researcher's job has shifted to "interpreting the results correctly" — and that capability comes only from steady time spent with proper textbooks.
The gap between "I made it run" and "I'm using this in research" is deeper than it looks. Don't put off learning the underlying theory — otherwise, you'll be the one being used by the tool, not the other way around.