First Steps in Machine Learning — Random Forest and SHAP

A guide for researchers who have finished the Python Vibe Coding Course and want to take the next step — actually implementing machine learning. This page introduces a hands-on sample notebook that uses Random Forest and SHAP (Shapley values) on agricultural biochar data, and walks through how to run it on your own machine. The aim is to move from "writing code" to "doing machine learning for science."

! Required reading: two books to start with

The most important thing in learning machine learning is to understand the ideas before memorizing the syntax. An LLM can write code for you, but reading the results — separating signal from noise, understanding what the model is actually doing — only comes from books.

The two books below are both freely available, world-class online textbooks, and both relate directly to the sample notebook. Please read through them before running the code.

Book 01 — Machine Learning for Science

Supervised Machine Learning for Science

Christoph Molnar & Timo Freiesleben (CC BY-NC-SA 4.0)

Reframes machine learning not as a "prediction tool" but as a scientific tool that integrates interpretability, causality, and uncertainty quantification. Lays out the basic mindset you should bring whenever you use ML in research.

Book 02 — Explaining black-box models

Interpretable Machine Learning

Christoph Molnar (3rd edition, 2024)

A comprehensive textbook on techniques for making black-box models explainable. Without grasping the theoretical meaning of SHAP (Shapley values), the figures produced by the sample code will only feel "vaguely important." At minimum, read the SHAP chapter.

The sample notebook produces results in just a few minutes if you only want to "make it run." But that path leads only to "feeling like you've done machine learning" — not to actually doing it. If you plan to use this in research, please go through the two books first to ground yourself in the ideas.

01 What the sample code does

The sample uses a dataset of biochar (charcoal-like soil amendment) properties from agricultural research. Biochar properties depend strongly on raw material, pyrolysis temperature, treatment time, and other production parameters. The notebook trains a Random Forest regression model from these production parameters and uses SHAP values to visualize which conditions matter, by how much, and in which direction.

Pipeline at a glance

1. Load the CSV data (biochar properties)
2. Define features and target
3. Train a Random Forest regression model
4. Compute SHAP values
5. Visualize the results (Summary plot, Dependence plot, etc.)

The entire workflow lives in a single Jupyter Notebook (.ipynb). You only need to run the cells from top to bottom.

View the repository on GitHub ↗

02 Run it on your local machine

This section assumes you have already finished Part 2 (Environment Setup) of the Python Vibe Coding Course. If you have not yet installed VSCode, Python, or set up a virtual environment, please go through that part first.

A. Download the repository

Open the GitHub repository page, click the green "Code" button, and choose "Download ZIP." Unzip the archive somewhere reasonable on your machine (e.g., C:\dev\RandomForestSHAP).

Putting the folder under OneDrive or in a deeply nested path can trigger Windows' "path too long" error. See the supplement on the Windows Long Path problem if you run into trouble.

B. Open the folder in VSCode and create a virtual environment

Launch VSCode and use "File → Open Folder" to open the unzipped folder. Then open a terminal (top menu: "Terminal → New Terminal") and create a virtual environment.

Terminal
# Create the virtual environment
python -m venv .venv

# Activate (Windows / PowerShell)
.venv\Scripts\Activate.ps1

# Activate (Mac / Linux)
source .venv/bin/activate

C. Install the libraries

With (.venv) showing at the start of your prompt, install the required libraries in one go:

Terminal (inside the virtual environment)
pip install pandas numpy scikit-learn shap matplotlib seaborn jupyter openpyxl

Once installation finishes, write out the versions for reproducibility:

Terminal
pip freeze > requirements.txt

D. Open the notebook and run it

From VSCode's left sidebar, double-click BiocharRandomForestSHAP.ipynb. In the top-right corner, click "Select Kernel" and choose your .venv environment. Then run each cell from top to bottom (click the ▶ button on the left of each cell, or press Shift + Enter).

Selecting the kernel works exactly the same way as in Part 2 / Step 7 of the Python course. If you forget to pick the virtual environment, you'll get "module not found" errors for all the libraries you just installed.

03 Reading the SHAP plots — the bare minimum

Roughly speaking, a SHAP value tells you "how much a given feature pushed the prediction away from the average, and in which direction." For a first pass, three plot types are enough:

Summary plot — Shows the overall importance of each feature, plus whether it tends to push predictions up or down.
Dependence plot — Plots a feature's value against its SHAP value. Reveals linear/non-linear relationships and interaction effects with other features.
Force plot / Waterfall plot — For a single sample, decomposes the prediction into the contribution of each feature.

For the deeper interpretation — and important caveats like "high SHAP importance does not mean causal effect" — please refer to the SHAP chapter of Interpretable Machine Learning.

04 Adapting it to your own research data

Once the sample runs, you'll naturally want to try it on your own data. In most cases you only need to change the data-loading section and the feature/target column names. Concretely, look at three places:

The pd.read_csv(...) argument — change the file path to point at your CSV.
The list of feature columns (X = df[[...]]) — replace with your own column names.
The target column (y = df[...]) — replace with the column you want to predict.

Example prompt for an LLM

"Please modify the code in BiocharRandomForestSHAP.ipynb to work with my own dataset my_data.csv. The columns in my data are: [list of columns]. The target variable I want to predict is [column name]."

If your dataset is small (a few dozen to a few hundred rows), watch out for overfitting with Random Forest. For cross-validation and hyperparameter tuning, refer back to the relevant chapters of Supervised Machine Learning for Science.

✓ What you can now do

You know the two essential books to read at the start of your ML journey
You can run the Random Forest + SHAP sample on your own machine
You understand the three core SHAP visualizations
You know the basic procedure to adapt the sample to your own data

Questions and suggestions are welcome via Contact (tomoakiyamaguchirice@gmail.com) or via GitHub Issues.

! One last word — using these methods without understanding them is meaningless

Running the code and seeing colorful plots can feel like understanding. But if you can't explain what each plot means in your own words, it isn't research — it's decoration. Machine learning and its interpretation methods, used without grasping their theoretical foundations, lead directly to wrong conclusions.

If you intend to use these methods in real research, read the two textbooks introduced at the top of this page, and read them repeatedly. Not once and done. Keep returning to them as you write code and look at results. In an era where LLMs can write the code, the researcher's job has shifted to "interpreting the results correctly" — and that capability comes only from steady time spent with proper textbooks.

The gap between "I made it run" and "I'm using this in research" is deeper than it looks. Don't put off learning the underlying theory — otherwise, you'll be the one being used by the tool, not the other way around.